Building an Alternative Data Explorer: Notes on Market Research and Analytics

Patrick Donahue · Levelbrook Consulting

The world of market research is undergoing a fundamental shift. Traditional analysis, reliant on quarterly earnings reports, government statistics, and company disclosures, is slow and backward-looking. The interesting frontier is in "alternative data"—vast, messy, high-frequency datasets that act as proxies for economic activity. This includes everything from anonymized credit card transactions and satellite imagery of parking lots to web scraping data and app usage logs.

From an engineering perspective, this domain is fascinating not because of the novelty of the data sources, but because of the foundational challenge they present: transforming high-volume, low-veracity, and wildly unstructured data into a low-latency, queryable source of truth. The goal is to build a system that allows an analyst to ask a simple question like, "What was the week-over-week growth in consumer spending at coffee shops in California?" and get a reliable answer in under a second. This is a non-trivial systems design problem.

System Architecture Sketch

Let's consider building an insights dashboard focused on aggregated consumer transaction data. The core task is to ingest millions of daily transaction records, clean them, aggregate them along various dimensions, and serve them via a fast, interactive UI. Here’s a pragmatic architecture using a stack of Go, Python, MySQL, Redis, and a message queue.

The data flow would look something like this:

Raw Data (SFTP/S3) → Python ETL Workers → Message Queue → Go Aggregation Service → MySQL (Analytics DB) + Redis (Cache) → Go API → Frontend

The Components and Their Roles

Python for ETL: Python, with its rich ecosystem of data manipulation libraries like Pandas and Dask, is the natural choice for the initial ingestion and transformation (ETL) layer. These workers would fetch raw data files (e.g., daily CSVs of transactions), perform initial cleaning, schema validation, and, crucially, entity resolution (mapping a messy string like "SQ *STARBUCKS #1234 NY" to a canonical `merchant_id`). Once processed, each record is published as a message to the queue.
Message Queue (e.g., RabbitMQ or NATS): This is the system's shock absorber. It decouples the batch-oriented ETL process from the real-time aggregation services. If a data provider delivers a massive file, the queue simply fills up, preventing the downstream services from being overwhelmed. It also provides durability; if a processing service crashes, the messages remain in the queue to be processed later.
Go for Processing and API: Go's concurrency model and performance characteristics make it ideal for the services that require high throughput. A pool of Go workers subscribes to the message queue. Each worker consumes a transaction message, updates the relevant aggregate tables in the database, and invalidates any related caches. A separate Go service provides the REST or GraphQL API for the frontend, querying the database and cache.
MySQL for the Source of Truth: While not a "big data" tool in the traditional sense, a well-tuned relational database like MySQL or PostgreSQL is perfect for storing the structured, aggregated results. The key is not to store the raw transaction data here, but rather pre-computed aggregates.
Redis for Speed: Redis serves as a caching layer for the API. When the frontend requests data for a common query (e.g., "last 30 days of spend for Merchant X"), the API can serve it directly from Redis. It's also useful for intermediate computations or managing distributed state.

The Data Model: Pre-Aggregation is Key

The secret to a fast analytics dashboard is to never, ever run queries against raw event data at request time. The data model must be built around pre-aggregated tables. The raw, anonymized transactions might live in a data lake like S3, but the queryable database should use a star schema.

-- Dimension table for merchants
CREATE TABLE dim_merchants (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    category VARCHAR(100),
    ticker_symbol VARCHAR(10),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Pre-aggregated daily data
CREATE TABLE agg_daily_merchant_spend (
    merchant_id BIGINT,
    metric_date DATE,
    total_spend_cents BIGINT,
    transaction_count INT,
    unique_customers INT,
    PRIMARY KEY (merchant_id, metric_date),
    FOREIGN KEY (merchant_id) REFERENCES dim_merchants(id)
);

When a new transaction message arrives, the Go processing service performs an `INSERT ... ON DUPLICATE KEY UPDATE` on the `agg_daily_merchant_spend` table for the corresponding merchant and date. Queries from the API are then simple `SELECT` statements against these aggregate tables, which are incredibly fast. You can create further aggregates for weekly or monthly rollups to make larger date range queries instantaneous.

Where It Breaks at Scale

Database Write Contention: With millions of daily events, the `agg_daily_merchant_spend` table can become a write hotspot. This can be mitigated by batching updates within the Go service (e.g., accumulating updates in memory for 1 second before flushing to the DB) or, at a larger scale, sharding the database by `merchant_id`.
Entity Resolution Failures: The Python ETL step is the most fragile. A new transaction description string that the mapping logic doesn't recognize will result in data being dropped or miscategorized. This is less a scaling problem and more a correctness problem.
Cache Invalidation: If historical data is re-processed or corrected, how do you correctly invalidate every cached query result in Redis that depends on that data? This is a famously hard problem. A pragmatic solution is often time-based (e.g., TTLs of 1 hour) combined with a manual flush mechanism for known corrections.

Pragmatic Tradeoffs and the Human in the Loop

Building a system like this is an exercise in managing tradeoffs. One of the most critical is recognizing that full automation is a myth. The system will never be perfect at entity resolution.

The most robust solution involves a human-in-the-loop workflow. When the Python ETL worker encounters a merchant string it cannot map with high confidence, it shouldn't discard it. Instead, it should flag it and insert it into a separate `unmapped_transactions` table. An internal admin tool can then be built for a data operations team to review these unmapped strings, assign them to the correct canonical merchant, and perhaps add a new rule to the mapping logic.

This feedback loop is vital. The system's accuracy is not a static property but one that improves over time through a partnership between the automated pipeline and human oversight. It's a pragmatic acknowledgment that for noisy, real-world data, the engineering task is to build tools that augment human intelligence, not replace it.

Another tradeoff is latency vs. consistency. For a dashboard, it's often acceptable to serve slightly stale data from a Redis cache to provide a sub-100ms user experience. The API can return a header indicating the age of the data, and the UI can offer a "refresh" button that bypasses the cache to fetch the absolute latest numbers from MySQL. This gives the user control while defaulting to a fast experience.

Closing Reflection

The engineering challenge of building an alternative data platform is ultimately about creating a stable, reliable abstraction over a chaotic reality. The raw data is a torrent of messy, incomplete signals. The system's purpose is to distill this chaos into a structured, queryable model of the world that is clean enough to be actionable, yet flexible enough to evolve as the underlying data sources inevitably change.

The most elegant code or the most scalable architecture is secondary to this core purpose. The interesting work lies at the intersection of robust system design and a deep understanding of the data's inherent imperfections. It's about building a system that not only processes data but also learns from its own mistakes with a little help from its human operators.