Building an E-commerce Bot Detection System: Notes on AI-First Search and Discovery

Modern e-commerce platforms are rapidly evolving beyond simple keyword search. The new frontier is "AI-first discovery," where semantic vector search, large language models for conversational queries, and generative recommendations create a deeply personalized and intuitive user experience. This shift, however, opens a new and expensive attack surface: automated bots. This article outlines my notes on architecting a real-time system to detect and mitigate this traffic.

Try the interactive demo

The Problem Domain: Why Bots are Uniquely Damaging to AI-First Platforms

In a traditional e-commerce stack, bot traffic is a known problem primarily associated with inventory sniping, price scraping, and denial-of-service attacks. While these remain threats, the move to AI-powered features introduces two more critical and costly dimensions:

Inference Cost Exploitation: Every query to a semantic search endpoint or a generative recommender hits a GPU-backed model. These computations are orders of magnitude more expensive than a simple database lookup. A bot running thousands of sophisticated queries ("show me dark wash jeans that would pair well with a vintage leather jacket") can silently generate thousands of dollars in cloud infrastructure costs in a very short time.
Data Pollution: AI models are only as good as their data. Recommendation engines and personalization models rely on user interaction signals—clicks, add-to-carts, purchases. Bots executing repetitive, non-human browsing patterns can poison this data, leading to skewed analytics, degraded model performance, and a worse experience for real users. Imagine a bot repeatedly clicking on an obscure product, causing it to trend and be recommended to everyone.

The technical challenge is to distinguish malicious bots from legitimate users—and desirable bots like search engine crawlers—in real time, without introducing user-facing latency or incorrectly blocking a high-value customer.

System Architecture: A Decoupled, Real-Time Approach

A viable system must be fast on the read path (checking if a request is from a bot) and can tolerate slight delays on the write path (analyzing behavior to update a bot score). This suggests a decoupled architecture using a fast, key-value store for enforcement and a message queue for asynchronous processing.

Here’s a blueprint using a common AWS stack, though the principles are portable:

Ingress & Data Collection: A request first hits an Application Load Balancer or API Gateway. The primary application backend (written in Go, Java, C#, etc., running on ECS) has a lightweight middleware. Its only job is to capture request metadata (IP, headers, path, timestamp, user ID if available) and fire it off to an AWS SQS queue as a JSON payload. This is a non-blocking, fire-and-forget operation to minimize impact on request latency.
Asynchronous Analysis & Scoring: A fleet of AWS Lambda functions (or an auto-scaling ECS service) written in Python consumes messages from the SQS queue. Python is a good fit here due to its rich ecosystem of data analysis and machine learning libraries. These workers perform the heavy lifting:
- Enrichment: Augment the raw data. Use a service like IPinfo to get geolocation and ASN data. Parse the User-Agent string to identify known bot signatures.
- Feature Calculation: Aggregate data over time windows. How many requests from this IP in the last minute? The last hour? Is the user navigating like a human (e.g., product -> cart -> checkout) or jumping between thousands of product pages directly?
- Scoring: Apply a model to these features to generate a "bot score." This can start as a simple rules engine and evolve into a trained ML model.

State Management with DynamoDB: The computed score and features are written to a DynamoDB table. DynamoDB is chosen for its single-digit millisecond latency on reads and writes, which is critical. The data model is key.

// DynamoDB Item Structure
{
  "PK": "IP#123.45.67.89",   // Partition Key (can also be USER#{id})
  "SK": "METADATA",           // Sort Key for single-item lookups
  "bot_score": 75,            // Aggregate score (0-100)
  "last_seen_ts": 1677610000,
  "counts_1m": 50,            // Request count in last 1 min
  "counts_1h": 300,
  "reasons": ["HIGH_REQ_RATE", "KNOWN_HOSTING_PROVIDER"],
  "is_blocked": false,
  "ttl": 1677696400         // Expire item to keep table clean
}

Real-Time Enforcement: On subsequent requests, the same middleware that logs the request first performs a quick, synchronous `GetItem` call to DynamoDB using the request's IP address as the key. This is the critical, latency-sensitive step. Based on the returned `bot_score`, it can:
- Score < 40: Allow request to proceed.
- Score 40-70: Serve the request, but perhaps from a slightly stale cache to protect expensive AI backends.
- Score > 70: Challenge with a CAPTCHA or return a 403 Forbidden.
Logging & Model Training: All raw event payloads sent to SQS are also asynchronously streamed to an S3 bucket via Kinesis Firehose. This provides a durable log of all traffic for offline analysis, debugging, and, crucially, for training more sophisticated machine learning models to improve the scoring logic over time. Tools like AWS Athena can query this data directly in S3.

Where It Breaks at Scale

Any distributed system has failure modes. For this architecture, the main concerns are:

DynamoDB Hot Partitions: A single, high-volume bot from one IP could hammer a single partition in DynamoDB. While DynamoDB has adaptive capacity, a poorly designed key schema can still cause throttling. Adding more entropy to the partition key, like `IP#{ip_address}:YYYY-MM-DD-HH`, can help distribute writes, but complicates the real-time lookup. A better approach is often to rely on caching (e.g., DynamoDB Accelerator - DAX) for extremely hot keys.
Stateful Aggregation: Calculating features like "requests in the last hour" requires state. Doing this accurately in a stateless Lambda function can be tricky. This often involves read-modify-write operations on DynamoDB, which can be slow or lead to race conditions. Using DynamoDB's atomic counters is a good pattern here. For more complex aggregations, a dedicated stream processing tool like Apache Flink or ksqlDB might be warranted, but that adds significant operational complexity.
The "Thundering Herd": If the analysis workers go down, messages will pile up in SQS. When the workers come back online, they will process a storm of stale data, potentially mischaracterizing current traffic patterns. Setting a message retention period on the SQS queue is essential to discard irrelevant old data.

Pragmatic Tradeoffs and the Human-in-the-Loop

The most dangerous failure mode is not technical; it's a false positive. Blocking a legitimate customer, especially during checkout, is far more costly than letting a scraper through for another five minutes. This reality dictates a pragmatic, incremental approach.

Start with a Rules Engine, Not ML: A complex gradient-boosted model is tempting, but a simple, transparent rules engine is the right place to start. A rule like `IF (requests_per_minute > 100) AND (user_agent contains 'python') THEN increment_score(30)` is debuggable, predictable, and easy to reason about. You ship this first, collect data on its performance, and then use that labeled data to train a V2 model.

Build an Internal Dashboard: The system must not be a black box. An internal tool is non-negotiable. It should allow an analyst to input an IP address or user ID and see:

Their current bot score and the reasons behind it.
A timeline of their recent requests.
The ability to manually override the score—to immediately unblock a customer who called support or to permanently block an obvious attacker.

This "human-in-the-loop" capability is the most important feature. The overrides serve not only as an escape hatch but also as a critical feedback mechanism. Every manual override is a high-quality label that can be used to retrain and improve the automated system.

Closing Reflection

Bot detection in the context of AI-driven platforms is a fascinating intersection of real-time data engineering, infrastructure design, and product thinking. The core tension is between protecting expensive computational resources and preserving a frictionless user experience. The solution isn't a single algorithm but a resilient, observable system that can evolve. It's a continuous cat-and-mouse game where the goal is not perfect prevention, but rapid, intelligent, and cost-effective mitigation.