Levelbrook Labs

Building an E-commerce Bot Detection System

Modern e-commerce platforms are rapidly evolving beyond simple keyword search. The new frontier is "AI-first discovery," where semantic vector search, large language models for conversational queries, and generative recommendations create a deeply personalized and intuitive user experience. This shift, however, opens a new and expensive attack surface: automated bots. This article outlines my notes on architecting a real-time system to detect and mitigate this traffic.

Try the interactive demo

The Problem Domain: Why Bots are Uniquely Damaging to AI-First Platforms

In a traditional e-commerce stack, bot traffic is a known problem primarily associated with inventory sniping, price scraping, and denial-of-service attacks. While these remain threats, the move to AI-powered features introduces two more critical and costly dimensions:

The technical challenge is to distinguish malicious bots from legitimate users—and desirable bots like search engine crawlers—in real time, without introducing user-facing latency or incorrectly blocking a high-value customer.

System Architecture: A Decoupled, Real-Time Approach

A viable system must be fast on the read path (checking if a request is from a bot) and can tolerate slight delays on the write path (analyzing behavior to update a bot score). This suggests a decoupled architecture using a fast, key-value store for enforcement and a message queue for asynchronous processing.

Here’s a blueprint using a common AWS stack, though the principles are portable:

  1. Ingress & Data Collection: A request first hits an Application Load Balancer or API Gateway. The primary application backend (written in Go, Java, C#, etc., running on ECS) has a lightweight middleware. Its only job is to capture request metadata (IP, headers, path, timestamp, user ID if available) and fire it off to an AWS SQS queue as a JSON payload. This is a non-blocking, fire-and-forget operation to minimize impact on request latency.
  2. Asynchronous Analysis & Scoring: A fleet of AWS Lambda functions (or an auto-scaling ECS service) written in Python consumes messages from the SQS queue. Python is a good fit here due to its rich ecosystem of data analysis and machine learning libraries. These workers perform the heavy lifting:
    • Enrichment: Augment the raw data. Use a service like IPinfo to get geolocation and ASN data. Parse the User-Agent string to identify known bot signatures.
    • Feature Calculation: Aggregate data over time windows. How many requests from this IP in the last minute? The last hour? Is the user navigating like a human (e.g., product -> cart -> checkout) or jumping between thousands of product pages directly?
    • Scoring: Apply a model to these features to generate a "bot score." This can start as a simple rules engine and evolve into a trained ML model.
  3. State Management with DynamoDB: The computed score and features are written to a DynamoDB table. DynamoDB is chosen for its single-digit millisecond latency on reads and writes, which is critical. The data model is key.
    // DynamoDB Item Structure
    {
      "PK": "IP#123.45.67.89",   // Partition Key (can also be USER#{id})
      "SK": "METADATA",           // Sort Key for single-item lookups
      "bot_score": 75,            // Aggregate score (0-100)
      "last_seen_ts": 1677610000,
      "counts_1m": 50,            // Request count in last 1 min
      "counts_1h": 300,
      "reasons": ["HIGH_REQ_RATE", "KNOWN_HOSTING_PROVIDER"],
      "is_blocked": false,
      "ttl": 1677696400         // Expire item to keep table clean
    }
  4. Real-Time Enforcement: On subsequent requests, the same middleware that logs the request first performs a quick, synchronous `GetItem` call to DynamoDB using the request's IP address as the key. This is the critical, latency-sensitive step. Based on the returned `bot_score`, it can:
    • Score < 40: Allow request to proceed.
    • Score 40-70: Serve the request, but perhaps from a slightly stale cache to protect expensive AI backends.
    • Score > 70: Challenge with a CAPTCHA or return a 403 Forbidden.
  5. Logging & Model Training: All raw event payloads sent to SQS are also asynchronously streamed to an S3 bucket via Kinesis Firehose. This provides a durable log of all traffic for offline analysis, debugging, and, crucially, for training more sophisticated machine learning models to improve the scoring logic over time. Tools like AWS Athena can query this data directly in S3.

Where It Breaks at Scale

Any distributed system has failure modes. For this architecture, the main concerns are:

Pragmatic Tradeoffs and the Human-in-the-Loop

The most dangerous failure mode is not technical; it's a false positive. Blocking a legitimate customer, especially during checkout, is far more costly than letting a scraper through for another five minutes. This reality dictates a pragmatic, incremental approach.

Start with a Rules Engine, Not ML: A complex gradient-boosted model is tempting, but a simple, transparent rules engine is the right place to start. A rule like `IF (requests_per_minute > 100) AND (user_agent contains 'python') THEN increment_score(30)` is debuggable, predictable, and easy to reason about. You ship this first, collect data on its performance, and then use that labeled data to train a V2 model.

Build an Internal Dashboard: The system must not be a black box. An internal tool is non-negotiable. It should allow an analyst to input an IP address or user ID and see:

This "human-in-the-loop" capability is the most important feature. The overrides serve not only as an escape hatch but also as a critical feedback mechanism. Every manual override is a high-quality label that can be used to retrain and improve the automated system.

Closing Reflection

Bot detection in the context of AI-driven platforms is a fascinating intersection of real-time data engineering, infrastructure design, and product thinking. The core tension is between protecting expensive computational resources and preserving a frictionless user experience. The solution isn't a single algorithm but a resilient, observable system that can evolve. It's a continuous cat-and-mouse game where the goal is not perfect prevention, but rapid, intelligent, and cost-effective mitigation.