Building AI Skill Assessment Demo: notes on Education Technology and Artificial Intelligence

The static nature of multiple-choice quizzes and even manually graded essays feels increasingly misaligned with how we actually apply knowledge. They test recall and pattern matching, but rarely probe the deeper, more nuanced reasoning that separates rote memorization from genuine expertise. This gap is a classic problem in education technology: how do you assess skills scalably, accurately, and in a way that reflects real-world problem-solving? It's a technically fascinating domain where UX, data modeling, and the probabilistic nature of LLMs intersect in a high-stakes environment.

This post is a set of engineering notes from a self-initiated proof-of-concept I built to explore this space. The goal was to move beyond simple Q&A bots and design a system capable of conducting a conversational, adaptive assessment of a technical skill and, if successful, issuing a verifiable credential.

Try the interactive demo

System Architecture

The core of the problem requires a system that can manage a stateful, long-running conversation, evaluate responses against a body of knowledge, make a structured judgment, and persist the entire process for audit. A modern, polyglot stack is a natural fit here.

Frontend and Real-Time UX: Next.js, React, TypeScript

The user experience must be conversational and responsive. A blank, blinking cursor is intimidating; a streaming, real-time response from the AI is engaging. This is a non-negotiable UX requirement.

Next.js with the App Router provides a solid foundation. Server Components can handle initial data fetching for the assessment context, while Client Components manage the interactive chat interface.
React and TypeScript are the obvious choices for building a type-safe, component-based UI. Managing the chat history, input state, and streaming response is a classic React use case.
Real-time Streaming: The backend uses a streaming protocol—Server-Sent Events (SSE) is a great fit—to push tokens from the LLM to the client as they're generated. This avoids the dreaded multi-second delay for a complete response. Libraries like the Vercel AI SDK formalize this pattern, but the underlying principle is a simple, unidirectional data flow that is robust and easy to implement.

Backend Services: Node.js and Python

A split backend recognizes that different languages excel at different tasks. Node.js is excellent for handling web traffic and I/O, while Python's ecosystem for AI/ML is unmatched.

Node.js (via Next.js API Routes): This layer acts as the Backend-for-Frontend (BFF). It handles user authentication, session management, and orchestrates calls to the Python service. It's the primary interface for the React client.
Python Service (on GCP Cloud Run / AWS Lambda): This is the AI brain. It's a separate, containerized service that exposes a few key endpoints (e.g., `/generate_question`, `/evaluate_answer`). It uses libraries like LangChain to structure interactions with LLMs and LlamaIndex for Retrieval-Augmented Generation (RAG). Decoupling this logic makes it independently scalable and easier to manage the complex Python dependency tree.

AI and Data Orchestration: LangChain, LlamaIndex, Claude

This is where the core assessment logic lives. The goal is not just to chat, but to guide a conversation and produce a structured, justifiable evaluation.

LlamaIndex for RAG: To assess a skill like "Advanced SQL," you need a ground truth. I indexed official PostgreSQL documentation, articles on query optimization, and style guides into a vector store. When a user provides an answer, LlamaIndex performs a similarity search to retrieve the most relevant document chunks. These chunks are injected into the evaluator's prompt, grounding the LLM's response in fact rather than just its parametric knowledge.
LangChain for Orchestration: LangChain is the glue. It chains together the different steps: taking user input, retrieving context via LlamaIndex, formatting a detailed prompt for the evaluator LLM, and parsing the LLM's JSON output.
Claude 3 Opus as the Evaluator: I opted for Claude for the core evaluation task due to its strong reasoning capabilities and proficiency with structured data formats like JSON. The system prompt instructs the model to act as an expert evaluator, referencing the provided context, and to return a JSON object with a score, a detailed rationale, and a follow-up question.

Data Model: PostgreSQL

The data model must be designed for auditability from day one. In a credentialing system, you must be able to reconstruct any assessment perfectly.

-- Simplified Schema
CREATE TABLE assessments (
    id UUID PRIMARY KEY,
    user_id UUID REFERENCES users(id),
    skill_id UUID REFERENCES skills(id),
    status TEXT NOT NULL, -- 'in_progress', 'completed', 'failed'
    final_score FLOAT,
    created_at TIMESTAMPTZ,
    completed_at TIMESTAMPTZ
);

CREATE TABLE conversation_turns (
    id UUID PRIMARY KEY,
    assessment_id UUID REFERENCES assessments(id),
    turn_number INT NOT NULL,
    role TEXT NOT NULL, -- 'user', 'ai_question', 'ai_evaluation'
    content TEXT,
    raw_llm_response JSONB, -- Store the full, raw response
    retrieved_context JSONB, -- Store the RAG context used
    created_at TIMESTAMPTZ
);

CREATE TABLE credentials (
    id UUID PRIMARY KEY,
    assessment_id UUID REFERENCES assessments(id),
    user_id UUID REFERENCES users(id),
    skill_id UUID REFERENCES skills(id),
    issued_at TIMESTAMPTZ
);

The key here is `conversation_turns`. Storing the role, the raw LLM response, and the exact RAG context used for each turn provides a complete, immutable log. This is critical for handling disputes, debugging the AI's reasoning, and enabling human review.

Pragmatism, Scale, and the Human in the Loop

Building a demo is one thing; deploying a production credentialing system is another. The biggest challenge isn't technical, but one of trust and correctness. LLMs hallucinate. They can be confidently wrong. Blindly trusting an AI to grant a credential would be irresponsible.

Where It Breaks

Consistency: LLMs are non-deterministic. A user taking the same test twice could get different questions and slightly different evaluations. Using a low temperature (e.g., 0.1) for evaluation prompts helps, but doesn't eliminate variability.
Edge Cases & Ambiguity: A human expert knows when a user's answer is "unconventional but correct." An LLM, even with RAG, might struggle with creative or tangential solutions, potentially misclassifying them as incorrect.
Cost at Scale: Every turn in the conversation is a series of expensive API calls (embedding, retrieval, generation). A 10-turn assessment could involve 20+ calls. Caching strategies and model choice (e.g., using a cheaper model for intermediate steps) become critical.

The Human-in-the-Loop Imperative

The only viable path to production for a high-stakes system like this is a Human-in-the-Loop (HITL) architecture. The AI is not the final arbiter; it's a powerful tool for scaling the time and attention of human experts.

The system should be designed to facilitate this collaboration. The AI can handle 80% of assessments autonomously, but it must be programmed to flag the ambiguous 20% for human review. These flags could be triggered by:

Borderline pass/fail scores.
Low confidence scores from the evaluation model.
Detection of potential prompt injection or unusual conversation patterns.
User-initiated requests for review.

The `conversation_turns` table, with its complete audit trail, becomes the foundation for the reviewer's UI. A human expert can see the exact context the AI used and quickly override or validate its conclusion. This feedback loop is also invaluable data for fine-tuning the models and prompts over time.

Closing Reflection

The engineering challenge here is more than just connecting APIs. It's about designing a socio-technical system that balances automation with the need for expert judgment. The goal isn't to create an automated credentialing machine, but to build a tool that allows for deeper, more contextual skill assessment than was ever possible at scale. The most interesting problems lie at this intersection of conversational UI, knowledge representation, and the messy, pragmatic work of building systems that people can trust.