Building Customer Identity Resolution: notes on an AI-first Customer Data Platform

Patrick Donahue · Levelbrook Consulting

The modern Customer Data Platform (CDP) promises a single, unified view of the customer. It's a compelling pitch: ingest every touchpoint—website visits, mobile app usage, support tickets, purchases—and distill it into a coherent "golden record" for each person. The engineering challenge, however, isn't just about volume or velocity; it's about ambiguity. The core problem is identity resolution: knowing that the user who browsed your site on their laptop last Tuesday is the same person who just bought a product on their phone using a different email address.

A traditional, rule-based CDP struggles here. It relies on deterministic joins: user_a.email === user_b.email. An "AI-first" approach reframes the problem. It treats identity not as a set of exact matches, but as a probability distribution over a graph of signals. This is where it gets technically interesting. It's a messy, statistical problem of inferring stable entities from a chaotic stream of partial, often conflicting, identifiers.

An Architectural Sketch

Let's consider how we might build a system to tackle this, using a stack that favors data transformation and scalability. A plausible core would involve Kafka for ingestion, Spark (with Clojure) for batch graph processing, Presto for ad-hoc querying, and Clojure/ClojureScript for the real-time feedback loop and human-in-the-loop tooling.

Data Model: The Identity Graph

The central data structure isn't a table of users; it's a graph.

Nodes are either Identifiers (e.g., `email:p.donahue@example.com`, `device_id:abc-123`, `phone:+15551234567`) or Profiles (the canonical "person" entity, e.g., `profile:xyz-789`).
Edges connect Identifiers to other Identifiers when they are observed together in an event (e.g., a form submission links an email to a device ID). They also connect Identifiers to the Profile they've been resolved to.

Each edge carries a weight or a confidence score. This score is the output of our model. It might be derived from heuristics (an email and phone number submitted on the same form get a high score) or a more sophisticated machine learning model that considers factors like time proximity, IP address similarity, and behavioral patterns.


;; A simplified Clojure data representation
{:nodes [{:id "email:p.donahue@example.com" :type :identifier}
         {:id "device:abc-123" :type :identifier}
         {:id "profile:xyz-789" :type :profile}]
 :edges [{:source "email:p.donahue@example.com"
          :target "device:abc-123"
          :weight 0.95
          :reason :form_submission}
         {:source "email:p.donahue@example.com"
          :target "profile:xyz-789"
          :weight 0.99
          :reason :graph_resolution_v2}]}

Processing Pipeline

1. Ingestion (Kafka): All raw events (`page_view`, `login`, `purchase`) are published to Kafka topics. They're immutable, replayable facts. Schematizing with something like Avro is essential for sanity.

2. Batch Resolution (Spark + Clojure): This is the heavy lifting. A periodic Spark job (daily, hourly) reads from the event log, builds the complete graph of observed identifier co-occurrences, and runs a graph clustering algorithm (like Connected Components or a custom variant) to group identifiers into profiles. Clojure's functional, data-oriented nature is a superb fit for expressing these transformations. Its JVM interoperability means we can use it directly with Spark's APIs, writing clear data pipelines that transform RDDs of events into a graph structure and finally into a set of resolved profiles. The output is a "golden record" table written to a data warehouse, queryable via Presto.

3. Real-time & UX (Clojure/ClojureScript): The batch job provides correctness over large datasets, but we need low-latency updates. When a new event arrives in Kafka, a separate stream processor (built in Clojure, perhaps using Kafka Streams) can perform a *provisional* update. It can look up the involved identifiers in a key-value store (like Redis or RocksDB) and make a quick, localized decision. This is critical for the UX. If a user logs in, we need to immediately associate their anonymous session with their known profile. The UI, built with ClojureScript and a data-centric framework like Re-frame, can subscribe to these real-time updates via WebSockets or SSE, reflecting the most current state of identity without waiting for the next batch run.

Pragmatic Tradeoffs and Where Things Break

This architecture is powerful, but complex systems have complex failure modes.

Scale and the "Hairball" Problem: The biggest challenge at scale is the emergence of massive, densely connected components in the graph. The classic example is a university computer lab IP address or a corporate NAT gateway linking thousands of unrelated student or employee profiles into a single, monstrous "hairball." This can cause catastrophic over-merging. The resolution process must include heuristics to identify and down-weight these "promiscuous" identifiers. Graph processing itself also becomes computationally expensive, potentially requiring specialized engines or careful partitioning strategies.

The Human-in-the-Loop Imperative: No automated system will be 100% accurate. The cost of a false positive (incorrectly merging two distinct people) is often far higher than a false negative (failing to merge two profiles of the same person). This mandates a human-in-the-loop workflow. The system must surface low-confidence merges to a data steward via an internal tool. This tool needs to visualize the evidence for a potential merge and allow a human to confirm or deny it. The steward's decision is then fed back into the system as a high-weight, ground-truth edge, improving future resolutions. This feedback loop is not a "nice-to-have"; it is a core component of a correct system.

Correctness vs. Latency: The dual batch/real-time path (Lambda architecture) is a classic tradeoff. The real-time path is fast but can make mistakes based on incomplete information. The batch path is slow but more accurate. The system must be designed to handle this eventual consistency, ensuring that the batch process can correct any errors made by the streaming component. The UI must also be clear about the confidence of its assertions, perhaps flagging profiles that have been recently updated but not yet confirmed by the batch process.

Closing Reflection

Building an identity resolution system is less about wrangling big data and more about modeling a fundamentally fuzzy concept—human identity—with deterministic code. The engineering isn't just in the choice of scalable tools, but in the careful design of the data model, the statistical heuristics, and the critical escape hatches that allow for human judgment. The most robust systems are not those that claim perfect accuracy, but those that acknowledge the inherent uncertainty of the task and provide the tools to manage it.