What I actually set out to build
I wanted to build a chatbot grounded in my own truth, not another thin wrapper over an LLM API. I was not interested in shipping a polished UI where the only technical story is that I connected a model endpoint and formatted the output. I wanted first-person recruiter answers backed by my own notes and project history, with citations and hard guardrails against hallucination.
I built and tuned the system locally first: ingestion and chunking, semantic retrieval behavior, intent routing, and confidence thresholds. I did not fine-tune a base foundation model from scratch, but I did train and tune the behavior of the application layer that decides when to retrieve, when to synthesize, when to cache, and when to refuse. Once the behavior was stable, I handed embedding and generation workloads off to Runpod endpoints so the same grounded behavior could run with better operational cost and latency controls.
Why this project became bigger than a chatbot
I started this project trying to build a portfolio chatbot for recruiter-style questions. I ended up building a small AI system with real engineering constraints.
The goal was simple on paper: answer questions about my background in first person, grounded in my own notes and project history. No fake confidence. No polished nonsense. Just truthful answers with citations, and a clear fallback when the evidence was weak.
That one constraint forced decisions across ingestion quality, chunking, semantic retrieval, confidence thresholds, caching policy, regression control, infrastructure cost, and route level observability.
The project is live at ellojesse.com. Ask it questions about my experience, projects, stack, and how I think through engineering decisions.
The actual problem to solve
People do not ask one canonical prompt. They ask infinite variations:
- What tech stack have you worked with?
- What stack do you have the most experience with?
- What have you shipped in production?
If the system is too strict, obvious questions miss. If it is too loose, unrelated prompts collide and return wrong content with confidence. So the core challenge was not one good answer. It was stable correctness across language variation without turning the project into regex whack a mole.
What I built
I moved to a hybrid architecture:
- Deterministic fast paths for proven stable intents
- Semantic retrieval for open phrasing
- Grounding confidence gate before synthesis
- Semantic cache for high confidence reuse
- Strict fallback when confidence is below threshold
- Response path metadata for observability and regression triage
Architecture overview
There are two flows that explain how the system works. The first shows the answer pipeline: how a recruiter question moves through retrieval, confidence gating, synthesis, fallback, and metadata logging. The second shows the route-level decision tree that helped me understand latency and optimize the system without weakening grounding quality.


Core concepts in plain English
Chunking
Long source docs are split into smaller pieces so retrieval can pull precise evidence instead of broad text blobs.
Embeddings
Questions and source chunks become vectors, which lets the system match meaning instead of exact wording.
Cosine similarity
A semantic closeness score used to rank candidate evidence before the model tries to answer.
Grounding threshold
A quality gate that blocks unsupported answers when the retrieved evidence is weak.
Semantic cache
High confidence answers to equivalent questions can be reused to reduce latency and cost.
Route metadata
Each response records how it was handled, which makes regressions easier to debug.
Why the benchmark mattered
I ran a benchmark because I did not want to optimize based on vibes. A few prompts can make a system feel solid, but they do not prove it behaves well across phrasing variation, route types, and cache states.
The benchmark gave me route level truth: where the system was fast, where it was slow, and what I could safely optimize next without weakening the grounding behavior.
What the benchmark actually showed
The benchmark used a 40 request production safe run. I was not trying to prove the system was perfect. I wanted to answer a more practical question: when people ask similar questions in different ways, can the system stay grounded while getting faster?
The topline result was yes. Cache enabled mode was materially faster in mixed traffic, and the biggest win showed up in tail latency. In plain English, tail latency is the slow end of the experience: the requests that make a user wonder if something is stuck.
Mean latency improved 28.1%, which means the average request got meaningfully faster. I also tracked p90, which means 90% of requests finished at or below that number, while the slowest 10% took longer. I used p90 because averages can hide painful outliers. A chatbot can feel fast most of the time, but if the slowest common requests drag, users still experience it as unreliable.
That is why the p90 drop mattered: it went from 3505.2 ms to 2092.4 ms. The system was not only faster on average; the slow end of the user experience became much less painful. The median moved less because this was not a cache only workload. Some requests were already fast deterministic paths, while many still went through slower direct canonical paths.
Route breakdown
Cache hit
5 requests averaged 212.8 ms. This was the fastest path and confirmed the semantic cache was doing its job.
Deterministic fast path
4 requests averaged 264.2 ms. Known stable intents resolved quickly without full generation.
Direct canonical
19 requests averaged 2726.9 ms. This route dominated the workload and became the next optimization target.
Other routed requests
12 requests averaged 1960.6 ms. These mixed paths still need route-level inspection.
That was the key signal. Cache strategy was not weak. Cache hits were very fast. The blended average was being pulled up by direct canonical volume.
Specific next optimization target
The next optimization target would be reducing the share of requests that land in direct_canonical, which is the slower path for prompts that still need a fuller generation step. The opportunity is to move repeatable, high confidence questions into faster routes without weakening the grounding checks.
- Expand cache eligibility for repeatable questions that are currently handled by
direct_canonical. - Move known high confidence semantic matches into faster paths when the evidence is strong enough.
- Track
direct_canonical_share, cache hit rate, andp90_by_routeon future deploys.
I could have kept tuning this, but at that point I had already spent quite a bit of time on the project. Perfection is the enemy of progress, and I wanted to ship the demo once I had a solid understanding of the principles at play: retrieval quality, grounding thresholds, cache behavior, latency tradeoffs, and regression testing.
The important takeaway is that I knew where the next bottleneck was. In a larger production project, this is exactly the kind of signal I would keep watching: which route is doing too much work, whether it can safely be made faster, and whether the user experience improves without sacrificing answer quality.
Tech stack and why I chose it
Next.js and TypeScript
Clear API boundaries, fast iteration, and type safety across frontend and server code.
Postgres and Prisma
Strong schema discipline and reliable state handling for sessions, messages, and memory records.
Nomic embeddings
Strong paraphrase handling for recruiter style questions that mean the same thing but use different words.
Runpod endpoints
A cost conscious fit for a personal project, with separate embedding and model paths for better control.
Eval suites
Regression checks so I could change routing logic without silently breaking grounded behavior.
Route metadata
Operational visibility into why a response took a given path and where latency was coming from.
Cost and latency considerations
I am cost conscious with personal projects, so I modeled the economics early. The point was not perfect billing reconciliation. The point was to understand the tradeoff between a cheap demo and a usable demo.
For operational viability, I was not thinking about a single cold endpoint. A user is not going to sit through that latency. The more realistic setup was keeping three embedding workers and three model workers available so the app could respond quickly enough to feel usable.
The directional Runpod assumptions were:
- Embedding endpoint: $0.00031/sec
- Model endpoint 80GB: $0.00076/sec
- Model endpoint 80GB PRO: $0.00116/sec
request_cost = (embed_seconds * embed_rate) + (model_seconds * model_rate)
effective_cost_with_cache = ((1 - hit_rate) * miss_cost) + cache_overhead
monthly_floor = endpoint_rate * replica_count * seconds_per_monthI used 10,000 requests as a normalization unit for request-level cost. That does not mean the chat expects 10,000 requests. It is just a clean ruler for comparing scenarios that can scale up or down. Separately, the always-on monthly floor matters because keeping capacity warm is what makes the latency acceptable.
This is not an always-on 30-day estimate. The setup only costs money while the endpoints are running, so the monthly number depends on active hours. With three embedding workers and three model workers, the active runtime cost is about $11.56/hour with standard 80GB model workers, or about $15.88/hour with 80GB PRO model workers.
That is where caching becomes more than a neat optimization. It cuts request-level work, but more importantly it can reduce how much warm capacity the system needs during active windows. If cache hit rate and fast-path coverage are high enough to safely run one fewer model worker, the savings are about $2.74 per active hour for a standard 80GB worker, or about $4.18 per active hour for an 80GB PRO worker. Monthly savings are simply that hourly savings multiplied by however many hours the system actually runs that month.
Pitfalls and decision signals
- Deterministic first handling did not scale across language variation.
- Global mean alone hid route level performance truths.
- Strict fallback policy was required to reduce hallucination risk.
- Broad AI coding edits increased regressions and token burn. Constrained scope workflows performed better.
- Cost and latency had to be modeled together, not as separate concerns.
How I used AI coding workflows to stay efficient
This project also changed how I work with AI coding tools. The best results did not come from asking for broad rewrites. They came from narrow tasks, explicit acceptance criteria, and route level verification.
I learned that the hard way during a Claude Code pass where the scope got too broad. It moved quickly, but it also introduced regressions that took time to unwind. After that, I treated AI coding help more like a focused engineering collaborator: one bounded change, one verification target, and a rollback path if the output drifted.
- Constrained scope prompts with clear success criteria.
- Route level debugging using response path metadata.
- Eval first iteration loops before broad changes.
- Micro benchmarks to verify p50 and p90 shifts without expensive test runs.
- Rollback discipline when broad edits regressed stable paths.
Closing
I did not just build a chatbot that can talk. I built a grounded system that can justify what it says, refuse when it should, and improve cost and latency through controlled reuse.
The biggest lesson was simple: a strong AI product is not the one that always answers. It is the one that answers when evidence is strong, proves it, and safely declines when it is not.
That is the standard I build toward now.
