Levelbrook Labs

Building an ERP Data Connector Demo: Notes on Market Research and Alternative Data Analytics

Try the interactive demo

The world of market research is increasingly turning to "alternative data" to gain an edge. While satellite imagery and credit card transactions get much of the attention, a vast, largely untapped source of signal lies within the operational guts of private companies: their ERP and e-commerce systems. Purchase orders, invoices, inventory levels, and supply chain logistics data offer a granular, real-time view of economic activity. The technical challenge, and the reason this is an interesting engineering problem, is that this data is notoriously difficult to access, normalize, and analyze at scale.

Unlike public APIs, there is no "Stripe" for ERP data. Every connection is a bespoke integration into a heterogeneous landscape of legacy systems, modern SaaS platforms, and everything in between. The data is messy, inconsistent, and siloed. The value isn't in a single invoice but in the aggregate trends across thousands of companies. This essay documents my notes on architecting a system to tackle this problem, born from building a proof-of-concept to explore the domain.

Architecting the Pipeline

A system like this is fundamentally a distributed data processing pipeline. The goal is to create a reliable, scalable, and observable flow from raw, third-party data to structured, queryable insights. Here's a breakdown of a potential architecture using a common set of battle-tested components.

Connectors & Ingestion (Go / Python)

The entry point is a fleet of connectors, each responsible for pulling data from a specific ERP or commerce platform. Go is an excellent choice for the core connector logic due to its concurrency primitives. A single Go binary can manage thousands of concurrent connections with a small memory footprint, making it cost-effective for polling many different endpoints.

Python, with its rich ecosystem of data manipulation libraries (like Pandas) and broad support for various APIs, serves as a powerful tool for building the business logic for each specific integration. You might have a core Go service that handles scheduling and network I/O, which then shells out to or calls a Python script for the actual data transformation.

The primary challenges here are authentication (OAuth2, API keys, custom flows), rate limiting, and handling idiosyncratic API responses. Each connector must be a robust, stateful agent that understands its target system's constraints.

Staging & Decoupling (Message Queue & Redis)

Raw data from connectors should never be written directly to a primary database. The ingestion layer must be decoupled from the processing layer. A message queue (like RabbitMQ or Kafka) is perfect for this. Connectors publish raw data payloads (e.g., a JSON blob of invoices) to a topic. This provides durability—if a downstream processor fails, the data is safe in the queue—and allows for independent scaling of consumers.

Redis plays a supporting role for managing state across the distributed system: tracking API rate limit counters, storing short-lived authentication tokens, or implementing distributed locks to prevent multiple connectors from polling the same account simultaneously.

Normalization & Storage (Python, MySQL, NoSQL)

A pool of Python workers consumes messages from the queue. Their job is to parse the raw, source-specific data and transform it into a canonical model. For example, an "invoice" object should have a consistent internal structure regardless of whether it came from QuickBooks, NetSuite, or a Shopify store.

The most critical decision here is choosing the right storage for different types of data. A hybrid approach is often best.

Analytics & Querying (Spark & Elasticsearch)

Once the data is normalized and stored, the final step is to make it useful. This is where large-scale analytics and fast querying come in.

Pragmatism, Scale, and the Human-in-the-Loop

A senior engineer's perspective is about understanding where systems break and making pragmatic tradeoffs. This architecture is no exception.

Where Things Break at Scale

The most common failure modes aren't clever bugs but mundane operational problems. Schema drift is a constant threat; a third-party API adds, removes, or changes a field, and your normalization code breaks. A malformed message—a "poison pill"—can get stuck in your queue, blocking all subsequent processing. Backpressure is another concern: a sudden influx of data from a large new customer can overwhelm your Python workers, causing the message queue to fill up and latency to spike.

Robust monitoring, dead-letter queues, and circuit breakers are not optional luxuries; they are fundamental requirements for a system like this to operate reliably.

The Correctness Angle: Human-in-the-Loop

No automated normalization process will ever be 100% perfect. There will always be ambiguous data, typos in source systems, or edge cases your code doesn't handle. The difference between a toy system and a production-grade one is acknowledging this and building tools to manage it.

This means a "human-in-the-loop" workflow is critical. You need an internal dashboard where a data operations team can:

This operational component is as important as the backend code. It ensures data quality and builds trust in the final analytical output.

Closing Reflection

Building a platform to aggregate and analyze operational business data is a fascinating problem that sits at the intersection of distributed systems, data engineering, and product design. The raw technical challenge lies in building a pipeline that is resilient to the chaos of external systems. But the real goal is more subtle: to find and amplify a clear, reliable signal from an immense amount of noise. It's about turning millions of individual, messy data points into a coherent, high-level understanding of economic trends as they happen.