← Back to Blog
Mar 25, 2026 8 min read Engineering

Building a Grounded Portfolio Chatbot Changed How I Think About AI Systems

I wanted a chatbot grounded in my own truth, not another thin model wrapper. That led me to build a retrieval system with local-first tuning, confidence gates, semantic caching, eval harnesses, and route-level observability.

Cover image

What I actually set out to build

I wanted to build a chatbot grounded in my own truth, not another thin wrapper over an LLM API. I was not interested in shipping a polished UI where the only technical story is that I connected a model endpoint and formatted the output. I wanted first-person recruiter answers backed by my own notes and project history, with citations and hard guardrails against hallucination.

I built and tuned the system locally first: ingestion and chunking, semantic retrieval behavior, intent routing, and confidence thresholds. I did not fine-tune a base foundation model from scratch, but I did train and tune the behavior of the application layer that decides when to retrieve, when to synthesize, when to cache, and when to refuse. Once the behavior was stable, I handed embedding and generation workloads off to Runpod endpoints so the same grounded behavior could run with better operational cost and latency controls.

Why this project became bigger than a chatbot

I started this project trying to build a portfolio chatbot for recruiter-style questions. I ended up building a small AI system with real engineering constraints.

The goal was simple on paper: answer questions about my background in first person, grounded in my own notes and project history. No fake confidence. No polished nonsense. Just truthful answers with citations, and a clear fallback when the evidence was weak.

That one constraint forced decisions across ingestion quality, chunking, semantic retrieval, confidence thresholds, caching policy, regression control, infrastructure cost, and route level observability.

Try the live version.

The project is live at ellojesse.com. Ask it questions about my experience, projects, stack, and how I think through engineering decisions.

The actual problem to solve

People do not ask one canonical prompt. They ask infinite variations:

  • What tech stack have you worked with?
  • What stack do you have the most experience with?
  • What have you shipped in production?

If the system is too strict, obvious questions miss. If it is too loose, unrelated prompts collide and return wrong content with confidence. So the core challenge was not one good answer. It was stable correctness across language variation without turning the project into regex whack a mole.

What I built

I moved to a hybrid architecture:

  • Deterministic fast paths for proven stable intents
  • Semantic retrieval for open phrasing
  • Grounding confidence gate before synthesis
  • Semantic cache for high confidence reuse
  • Strict fallback when confidence is below threshold
  • Response path metadata for observability and regression triage

Architecture overview

There are two flows that explain how the system works. The first shows the answer pipeline: how a recruiter question moves through retrieval, confidence gating, synthesis, fallback, and metadata logging. The second shows the route-level decision tree that helped me understand latency and optimize the system without weakening grounding quality.

Grounded AI recruiter answer pipeline diagram
Architecture 1: normalize the question, retrieve evidence, apply a confidence gate, synthesize or fallback, optionally refresh cache, and return the response with metadata.
Grounded AI recruiter request routing and latency path diagram
Architecture 2: route-level flow for deterministic matches, cache-eligible requests, cache hits, direct canonical generation, grounded synthesis, fallback, and cache writes.

Core concepts in plain English

Chunking

Long source docs are split into smaller pieces so retrieval can pull precise evidence instead of broad text blobs.

Embeddings

Questions and source chunks become vectors, which lets the system match meaning instead of exact wording.

Cosine similarity

A semantic closeness score used to rank candidate evidence before the model tries to answer.

Grounding threshold

A quality gate that blocks unsupported answers when the retrieved evidence is weak.

Semantic cache

High confidence answers to equivalent questions can be reused to reduce latency and cost.

Route metadata

Each response records how it was handled, which makes regressions easier to debug.

Why the benchmark mattered

I ran a benchmark because I did not want to optimize based on vibes. A few prompts can make a system feel solid, but they do not prove it behaves well across phrasing variation, route types, and cache states.

The benchmark gave me route level truth: where the system was fast, where it was slow, and what I could safely optimize next without weakening the grounding behavior.

What the benchmark actually showed

The benchmark used a 40 request production safe run. I was not trying to prove the system was perfect. I wanted to answer a more practical question: when people ask similar questions in different ways, can the system stay grounded while getting faster?

The topline result was yes. Cache enabled mode was materially faster in mixed traffic, and the biggest win showed up in tail latency. In plain English, tail latency is the slow end of the experience: the requests that make a user wonder if something is stuck.

Miss mean
2252.8 ms
Baseline request path
Hit mean
1620.2 ms
Cache enabled path
Mean gain
28.1%
Material speedup
P90 drop
3505 → 2092 ms
Tail latency improved

Mean latency improved 28.1%, which means the average request got meaningfully faster. I also tracked p90, which means 90% of requests finished at or below that number, while the slowest 10% took longer. I used p90 because averages can hide painful outliers. A chatbot can feel fast most of the time, but if the slowest common requests drag, users still experience it as unreliable.

That is why the p90 drop mattered: it went from 3505.2 ms to 2092.4 ms. The system was not only faster on average; the slow end of the user experience became much less painful. The median moved less because this was not a cache only workload. Some requests were already fast deterministic paths, while many still went through slower direct canonical paths.

Route breakdown

Cache hit

5 requests averaged 212.8 ms. This was the fastest path and confirmed the semantic cache was doing its job.

Deterministic fast path

4 requests averaged 264.2 ms. Known stable intents resolved quickly without full generation.

Direct canonical

19 requests averaged 2726.9 ms. This route dominated the workload and became the next optimization target.

Other routed requests

12 requests averaged 1960.6 ms. These mixed paths still need route-level inspection.

Route level timing showed that the cache was fast, but the overall workload was still dominated by slower canonical paths.

That was the key signal. Cache strategy was not weak. Cache hits were very fast. The blended average was being pulled up by direct canonical volume.

Specific next optimization target

The next optimization target would be reducing the share of requests that land in direct_canonical, which is the slower path for prompts that still need a fuller generation step. The opportunity is to move repeatable, high confidence questions into faster routes without weakening the grounding checks.

  • Expand cache eligibility for repeatable questions that are currently handled by direct_canonical.
  • Move known high confidence semantic matches into faster paths when the evidence is strong enough.
  • Track direct_canonical_share, cache hit rate, and p90_by_route on future deploys.

I could have kept tuning this, but at that point I had already spent quite a bit of time on the project. Perfection is the enemy of progress, and I wanted to ship the demo once I had a solid understanding of the principles at play: retrieval quality, grounding thresholds, cache behavior, latency tradeoffs, and regression testing.

The important takeaway is that I knew where the next bottleneck was. In a larger production project, this is exactly the kind of signal I would keep watching: which route is doing too much work, whether it can safely be made faster, and whether the user experience improves without sacrificing answer quality.

Tech stack and why I chose it

Next.js and TypeScript

Clear API boundaries, fast iteration, and type safety across frontend and server code.

Postgres and Prisma

Strong schema discipline and reliable state handling for sessions, messages, and memory records.

Nomic embeddings

Strong paraphrase handling for recruiter style questions that mean the same thing but use different words.

Runpod endpoints

A cost conscious fit for a personal project, with separate embedding and model paths for better control.

Eval suites

Regression checks so I could change routing logic without silently breaking grounded behavior.

Route metadata

Operational visibility into why a response took a given path and where latency was coming from.

Cost and latency considerations

I am cost conscious with personal projects, so I modeled the economics early. The point was not perfect billing reconciliation. The point was to understand the tradeoff between a cheap demo and a usable demo.

For operational viability, I was not thinking about a single cold endpoint. A user is not going to sit through that latency. The more realistic setup was keeping three embedding workers and three model workers available so the app could respond quickly enough to feel usable.

The directional Runpod assumptions were:

  • Embedding endpoint: $0.00031/sec
  • Model endpoint 80GB: $0.00076/sec
  • Model endpoint 80GB PRO: $0.00116/sec
request_cost = (embed_seconds * embed_rate) + (model_seconds * model_rate)
effective_cost_with_cache = ((1 - hit_rate) * miss_cost) + cache_overhead
monthly_floor = endpoint_rate * replica_count * seconds_per_month

I used 10,000 requests as a normalization unit for request-level cost. That does not mean the chat expects 10,000 requests. It is just a clean ruler for comparing scenarios that can scale up or down. Separately, the always-on monthly floor matters because keeping capacity warm is what makes the latency acceptable.

0% hit rate
~$12.53
Per 10k requests
50% hit rate
~$6.27
Per 10k requests
70% hit rate
~$3.76
Per 10k requests
90% hit rate
~$1.25
Per 10k requests

This is not an always-on 30-day estimate. The setup only costs money while the endpoints are running, so the monthly number depends on active hours. With three embedding workers and three model workers, the active runtime cost is about $11.56/hour with standard 80GB model workers, or about $15.88/hour with 80GB PRO model workers.

3 embed workers
~$3.35/hr
While running
3 model workers
~$8.21/hr
80GB while running
3 PRO workers
~$12.53/hr
80GB PRO while running
Save 1 PRO worker
~$4.18/hr
Multiply by active hours/month

That is where caching becomes more than a neat optimization. It cuts request-level work, but more importantly it can reduce how much warm capacity the system needs during active windows. If cache hit rate and fast-path coverage are high enough to safely run one fewer model worker, the savings are about $2.74 per active hour for a standard 80GB worker, or about $4.18 per active hour for an 80GB PRO worker. Monthly savings are simply that hourly savings multiplied by however many hours the system actually runs that month.

Pitfalls and decision signals

  • Deterministic first handling did not scale across language variation.
  • Global mean alone hid route level performance truths.
  • Strict fallback policy was required to reduce hallucination risk.
  • Broad AI coding edits increased regressions and token burn. Constrained scope workflows performed better.
  • Cost and latency had to be modeled together, not as separate concerns.

How I used AI coding workflows to stay efficient

This project also changed how I work with AI coding tools. The best results did not come from asking for broad rewrites. They came from narrow tasks, explicit acceptance criteria, and route level verification.

I learned that the hard way during a Claude Code pass where the scope got too broad. It moved quickly, but it also introduced regressions that took time to unwind. After that, I treated AI coding help more like a focused engineering collaborator: one bounded change, one verification target, and a rollback path if the output drifted.

  • Constrained scope prompts with clear success criteria.
  • Route level debugging using response path metadata.
  • Eval first iteration loops before broad changes.
  • Micro benchmarks to verify p50 and p90 shifts without expensive test runs.
  • Rollback discipline when broad edits regressed stable paths.

Closing

I did not just build a chatbot that can talk. I built a grounded system that can justify what it says, refuse when it should, and improve cost and latency through controlled reuse.

The biggest lesson was simple: a strong AI product is not the one that always answers. It is the one that answers when evidence is strong, proves it, and safely declines when it is not.

That is the standard I build toward now.