Levelbrook Labs

Building AI Financial Analysis Demo: Notes on Artificial Intelligence for Financial Services and Consulting

Patrick Donahue · Levelbrook Consulting

The intersection of unstructured text and structured quantitative data is one of the most compelling, and difficult, domains for applied AI. Financial services and consulting are built on this junction: analysts read thousands of pages of SEC filings, earnings call transcripts, and market news (unstructured text) to inform their valuation models and forecasts (structured data). The core engineering challenge is not merely to automate this, but to build a system that augments the analyst's ability to discover insights, verify claims, and synthesize a narrative—all while maintaining an exceptionally high bar for correctness.

This problem space is technically interesting because it's not a straightforward application of a single model. It requires a multi-stage, hybrid architecture that blends classic data processing, specialized NLP models, and generative LLMs. Building a proof-of-concept is an exercise in systems design, exploring how to chain these disparate components into a cohesive, responsive, and—most importantly—trustworthy user experience. I recently built a small demo to explore these mechanics firsthand.

Try the interactive demo

The Domain Problem: From Documents to Decisions

An analyst needs to understand the health and trajectory of a company. Their raw materials include:

Structured Data: Quarterly balance sheets, income statements (e.g., from XBRL data in SEC filings), and historical market data.
Unstructured Data: The Management's Discussion and Analysis (MD&A) section of a 10-K, transcripts of CEO interviews, competitor press releases, and industry news.

The human process involves reading the text to find context for the numbers. Why did revenue increase? The MD&A might mention a new product line. Why did margins shrink? The earnings call might reveal supply chain issues. The goal is to synthesize these qualitative insights with the quantitative facts into a coherent report.

An AI system must replicate this synthesis. It needs to parse tables of numbers and understand the nuance of executive commentary. This immediately rules out a naive "feed everything to a GPT" approach. The risk of hallucination is too high, and the ability to trace a conclusion back to a specific sentence in a specific document is non-negotiable.

System Architecture: A Pragmatic Hybrid Approach

A production-grade system for this task is inherently a set of cooperating services. Here’s a breakdown of a potential architecture, using the specified polyglot stack of Python, PHP, and JavaScript, deployed on a cloud platform like AWS or Azure.

1. Data Ingestion & Pre-processing (Python + Cloud Storage)

The first step is a robust ETL pipeline. A Python service running on a schedule (e.g., via AWS Lambda or Azure Functions) would fetch data from sources like the SEC EDGAR API, financial data providers, and news feeds. Documents (10-Ks, transcripts) are stored in an object store like S3 or Azure Blob Storage. Structured financial data is cleaned and inserted into a relational database (e.g., PostgreSQL).

The crucial pre-processing step for text is chunking. A 200-page 10-K is too large for most models. It must be split into logical, semantically-aware chunks (e.g., by section, paragraph) and stored alongside metadata linking back to the source document and page number. This metadata is the foundation of verifiability later on.

2. The AI Core: A Multi-Stage NLP Pipeline (Python + PyTorch/TensorFlow)

This is a set of containerized Python services, each with a specific task. They communicate via internal APIs or a message queue.

Insight Extraction Service: This service uses specialized, fine-tuned models—not a general LLM. For example, a model like FinBERT (a BERT model pre-trained on financial text) running on PyTorch or TensorFlow can perform Named Entity Recognition (NER) to tag mentions of products, competitors, or risks, and perform sentiment analysis on executive statements. The output isn't a paragraph of text, but structured JSON: {"text": "...", "source": "doc_id:page_5", "insight_type": "risk_factor", "sentiment": "negative"}. These structured insights are stored, perhaps in a NoSQL database or a search index like Elasticsearch.
Vectorization Service: The text chunks from step 1 are passed through an embedding model (e.g., from SentenceTransformers) to create vector representations, which are then stored in a vector database (like Pinecone, Weaviate, or a Postgres extension like pgvector). This enables semantic search.
Synthesis Service (RAG): This is where the LLM comes in. When a user requests a report on "Q4 revenue drivers," this service first queries the structured database for the revenue numbers. It then uses the query to perform a semantic search against the vector database to find the most relevant text chunks from filings and transcripts. This context (quantitative data + relevant text snippets) is packed into a carefully crafted prompt and sent to a powerful generative model (e.g., GPT-4 via Azure OpenAI or Claude via AWS Bedrock). This is the Retrieval-Augmented Generation (RAG) pattern, which grounds the LLM in specific, factual data, dramatically reducing hallucinations.

3. Application & Presentation Layer (PHP/JS + Real-time UX)

The user-facing application orchestrates the process. While my preference is often Rails for its integrated nature, a PHP backend (using a framework like Laravel or Symfony) is perfectly suited to serve as the API gateway.

The PHP backend would handle user authentication, manage analysis requests, and call the various Python services. For a long-running report generation, it would initiate the job and return a job ID. The client would then poll or connect via a WebSocket/SSE for status updates.

The frontend, built with a modern JavaScript framework like React or Vue, is where the system's value is truly expressed. A static report is not enough. The UI must be interactive:

Streaming Output: The synthesized text from the LLM should stream into the browser token-by-token (using Server-Sent Events or WebSockets). This provides immediate feedback and a much better user experience than a minutes-long loading spinner. My own work on libraries like `ai_stream` for Ruby explores exactly this kind of real-time data protocol.
Source Highlighting: Every sentence or claim generated by the AI should be interactive. Hovering over it could reveal a tooltip with the source document and page number. Clicking it could highlight the exact passage in a side-by-side view of the original source PDF. This is non-negotiable for building trust.
Data Visualization: The structured quantitative data should be rendered as interactive charts (using a library like D3.js or Chart.js), tightly integrated with the narrative text.

The data model is key. A central AnalysisReport table would link to users, source documents, and the structured JSON output. The JSON itself needs a well-defined schema to ensure the frontend can reliably parse and render the report, including the crucial source attribution metadata for each piece of generated content.

Where It Breaks at Scale

This architecture has failure points. The vector search can become a bottleneck if not properly indexed. LLM API calls can be slow and expensive; aggressive caching of common queries and pre-generating reports for high-traffic companies is essential. The biggest challenge is the state management of long-running, multi-stage analysis jobs. Using a robust message queue (like RabbitMQ or AWS SQS) and designing idempotent services is critical to ensure that a failure in one stage doesn't corrupt the entire process.

Pragmatic Tradeoffs & The Human-in-the-Loop

A senior engineer's job is to make tradeoffs. In this domain, the primary tension is between automation and correctness.

1. Speed vs. Depth: A real-time, on-demand report is a fantastic UX goal. However, a deep analysis involving multiple large documents and fine-tuned models can take minutes. A pragmatic solution is a tiered approach: provide an instant "headline" summary based on cached or pre-computed data, while running the full, deep analysis in the background and streaming in the detailed sections as they become available.

2. Full Automation vs. Analyst Augmentation: The goal should not be to produce a final report that an analyst blindly forwards to a client. The risk of subtle errors or misinterpretations is too high. Instead, the system should be designed as a "first draft" generator. The UI should include an "edit and verify" mode where a human analyst can review the AI's output, correct inaccuracies, and add their own insights. The system becomes a powerful tool that eliminates 80% of the manual drudgery (finding and collating information), freeing up the analyst to focus on the 20% that requires true human expertise (critical thinking, strategic interpretation).

This human-in-the-loop model is the only responsible way to deploy such technology in high-stakes environments. The system should log which parts of the report were AI-generated and which were human-edited, creating an audit trail and a valuable feedback loop for improving the models over time.

Closing Reflection

Building a system like this is a microcosm of modern software engineering. It requires a deep understanding of data flow, API design, user experience, and cloud infrastructure, all in service of orchestrating sophisticated AI models. The purely technical challenge of making these components work together is significant. But the more profound challenge is philosophical: designing a system that is transparent, verifiable, and ultimately trustworthy. The most successful AI tools in finance and consulting won't be black boxes that claim to have "the answer." They will be glass boxes that empower human experts to find the answer themselves, faster and with greater confidence than ever before.