Levelbrook Labs

Building a Customer Journey Analytics Dashboard: Notes on Product Analytics and Innovation

Try the interactive demo

The Domain Problem: Beyond the Page View

Product analytics has evolved far beyond simple page view counters. The interesting engineering problem isn't just counting events; it's reconstructing user intent from a sparse, often chaotic stream of those events. A "customer journey" is an abstraction we impose on this data. A user doesn't think in terms of funnels; they click, type, hesitate, and navigate. Our job is to build a system that can reliably translate this firehose of low-level interactions (`button_clicked`, `form_submitted`, `page_viewed`) into high-level narratives: "user struggled with checkout," "user compared three products before adding one to cart," or "user dropped off during onboarding."

This is technically compelling because it's a stateful problem at its core. To understand one event, you need the context of the events that preceded it. This immediately complicates architectures designed for stateless, horizontally-scalable services. It requires robust sessionization, handling of out-of-order and late-arriving data, and a data model that can answer complex sequence-based queries efficiently. At scale, this becomes a significant distributed systems challenge where throughput, latency, and correctness are in constant tension.

An Architectural Sketch

To tackle this, we need a pipeline that separates concerns: ingestion, processing, storage for different query patterns, and visualization. Here’s a pragmatic stack choice for a high-throughput system, orchestrated on Google Cloud Platform (GCP) with Kubernetes.

Logical Flow: Event Emitters (Web/Mobile) → Kafka → Real-time Processor → MongoDB (Raw Events/Profiles) & ClickHouse (Aggregates) → Grafana / Custom UI

Data Models, Edge Cases, and Scaling Pains

The core analytical table in ClickHouse might look something like: (user_id, session_id, event_timestamp, event_name, properties, ...), partitioned by day and ordered by user and timestamp. This structure is critical for performance.

Things inevitably break at scale:

Pragmatic Tradeoffs and the Human in the Loop

A senior engineer’s role is to make pragmatic decisions. "Real-time" is a spectrum. Does the product team truly need sub-second latency, or is a 5-minute micro-batch that's 10x cheaper and 100x more reliable the better choice? The answer is almost always the latter.

Similarly, choosing managed services on GCP (like Pub/Sub over Kafka, or BigQuery over ClickHouse) trades control and potential cost savings for reduced operational overhead. For a small team, this is often the right call. For a large-scale system where performance tuning is critical, self-hosting on GCE instances might be justified.

Most importantly, no automated system is perfect. Data will be corrupted. A bug in a new app release might send malformed events for hours. The architecture must include a "human-in-the-loop" component. This means building administrative tools for data stewards to inspect raw event streams, manually correct user journeys, flag anomalous sessions, and trigger re-processing of data for a specific time range. Correctness is not just an algorithmic property; it's an operational one. The system must be debuggable and repairable by people.

Closing Reflection

Building a system for customer journey analytics is a fascinating microcosm of modern data engineering. It forces a synthesis of distributed systems principles for scale, meticulous data modeling for performance, and a deep, empathetic understanding of the product domain to ensure the final output is not just a collection of metrics, but a source of genuine insight. The ultimate goal isn't a perfect, hands-off machine, but a powerful tool that augments human intuition, helping us understand the narrative hidden within the noise.