Happy families are all alike; every unhappy family is unhappy in its own way.
– Leo Tolstoy
Tidy datasets are all alike, but every messy dataset is messy in its own way.
– Hadley Wickham
The idea is called the Anna Karenina principle. For the purposes of digital marketing, I’ve found “tidy data” are deduplicated, merged from multiple sources, and formatted in a sensible and actionable way.
In this post, we’ll focus on the merging and deduplicating in a process that’s generally known as as identity resolution. (We won’t talk about formatting in this post, but get excited! We’ll address it in a future post.)
Identity resolution is no small task — if it were easier, you wouldn’t be reading a blog post like this — but with the right tools and the right framework, it can become a lot more manageable.
Entities and Fragments
Let’s start by defining a term we’ll encounter a lot: entity: a thing with a distinct and independent existence. For our purposes here, an entity is a data representation of something you care about, with data coming from multiple sources. An entity could be a building, with data coming from government or bank records, and addresses in multiple forms. An entity could be a music album, with data coming from music reviews and blog posts, or structured data sources like Wikidata for genre information. But if you’re reading this blog post, the entities you care about are probably humans, with data coming from web traffic, email activity, support tickets, social media, etc.
That’s actually a pretty big et cetera — ChiefMartec’s 2017 supergraphic has 5,381 “marketing technology solutions,” each powered by its own data, and each one generating potentially many data streams. So any approach we take to resolution will need to work with data from an arbitrary number of sources.
In this post, we’ll use the term data stream to describe data of a similar kind from the same data source. Each data stream tells a slightly different story about an entity, and usually necessitates summarizing that data a little bit differently. For a data stream containing email event data, we’d probably want to summarize that stream with total email open count, subscription status, or the timestamp of the last email a user was sent. For a data stream containing web activity data, we’d want to summarize the stream with things like URLs viewed, last active date, or a list of all of a user’s HTTP referrers.
Each of those stream summaries is called an entity fragment. An customer’s entity can contain data from one or fragments, and so the task of identity resolution lies in connecting an entity to all of its entity fragments.
Lots of times when you want to represent connections between a bunch of different things, like entity fragments, you’ll put it in a graph — the same kind of mathematical model Mark Zuckerberg referred to when he used to talk about Facebook as a “social graph” and the same kind we use in our topic graph inside our Lytics Content Affinity Engine. Modeling entity fragments in a graph allows us to be realistic about the relationships we expect to see in the data, and provides the flexibility we need to connect an entity to one or more fragments from one or more data streams.
Let’s take a look at a graph of two entities — my grandmother and me. My browsing behavior, which is all over the place, might lead to web activity fragments from my phone, my work laptop and my personal laptop, with each fragment being identified by a different cookie on each different browser I’m using (and I tend to be pretty frequent about clearing my cookies). On the other hand, my grandmother’s browsing behavior would realistically be identified on a single fragment, as she only uses her desktop computer and has no idea what a cookie is, let alone how to delete one.
A graph makes no assumptions about how many relationships might exist between fragments, and lets the data speak completely for themselves. Let’s have a look at what mine and my grandmother’s entity graphs might look like, visually.
This is straightforward enough for a single data stream, but the graph model allows us to maintain the same flexible model even when we add other data sources. Let’s continue with one of the more common and simple cases of multi-channel data management: web + email.
While web fragments are typically identified by cookies, email fragments are typically identified by an email address. Every time I register web activity, the web fragment identified by my current web cookie is augmented with new data. Every time I engage with email, the email fragment identified by my email address is augmented with new data.
Now let’s say I’m browsing on my favorite clothing site, and I’m always logged into my account on my personal laptop, but I’m never logged into my account on my phone. As far as this retailer is concerned, those two fragments represent two entirely different humans, as there’s no evidence linking the two. Later, I read an email on my phone about some holiday promotion, and click on a link in the email, which brings me to the site. That email click, which opened that link in my phone’s browser, helped associate the phone’s web fragment on my phone with my email fragment, which was already associated with my laptop’s web fragment. Now, the three fragments — two web fragments and one email fragment — are all tied to the same customer entity.
The Beauty of Edges
Graph practitioners will describe a graph in terms of nodes and edges. So far, we’ve been using node and fragment interchangeably.
We’ve also been calling connections what most graph practitioners would call edges. In the previous example, we used an edge to represent the newly found connection between the web and email fragments. Edges have a couple of really nice characteristics that help us maintain flexibility during identity resolution:
- Flexible creation and removal: Since we can add a connection between fragments without changing the underlying fragment data, we can also take it away without losing any fragment data. That way, if we observe evidence that two fragments shouldn’t be connected — like a spurious fragment that’s connected to way too many fragments, generated on a publicly accessed computer — we can remove its edges and maintain entity integrity.
- Source-independent edge creation: Edges can be added when we explicitly observe linkage between fragments, but can also be added with machine learning models that probabilistically determine that two fragments should be linked. The probability threshold for declaring edges can be determined by humans, and custom models can predict the relationships in a very custom and configurable way. Machine learning processes can create edges in either real-time or in a supervised batch process, and the end result will be the same.
- There’s no constraint on the number of edges that can or must exist in a graph: Often in a relational data model (think SQL), representing connections with many-to-many joins can be cumbersome, to put it lightly. In a graph model, there’s nothing in the framework that will dictate the number of relationships a given fragment can share, and the assumption that connections will exist is fully baked into the model.
Here’s an example of what an actual, in-the-wild entity looks like in Lytics’ graph database. In all likelihood, you’re not going to see such a comprehensive, 360° view of your customers without a graph database powered by an underlying, flexible fragment model.