Context Construction That Doesn't Blow Up Latency: A Budgeted RAG Assembly for FastAPI
Latency is not a model problem, it is a context assembly problem
If your RAG endpoint is slow, you are probably “being helpful” in the most expensive place: the request path.
It works in a notebook because nothing is contending for CPU, network I/O, or database connections. Then you ship it behind FastAPI, add real traffic, and suddenly every answer is gated by chunk parsing, embedding calls, and over-wide retrieval.
Context Construction That Doesn’t Blow Up Latency means treating context as a budgeted artifact you assemble under hard caps, not a pile of text you keep appending until the model stops complaining.
Context construction is feature engineering with a latency bill
Grounding this in systems terms helps because it changes what you optimize.
RAG is not “retrieve then prompt.” It is a per-request context construction pipeline: take a query, fetch candidate evidence, and assemble a context payload that the generator can actually use. Chip Huyen’s framing is useful here: context construction for foundation models is equivalent to feature engineering for classical ML models. The difference is that your “features” are tokens, and tokens have a direct cost and latency footprint.
Why this breaks in production is predictable:
Every extra retrieved chunk increases prompt tokens, which increases generation latency and cost.
Every retrieval expansion increases database work and network I/O.
Every “just in case” document you include increases the chance the model attends to the wrong thing, not the right thing.
Here is the shift that fixes it: stop thinking of context as “more is safer,” and start thinking of context as “a constrained payload with a strict budget and a clear provenance.” In real deployments, the fastest RAG systems are not the ones with the fanciest prompts. They are the ones that treat context assembly like an algorithm with guardrails.
Screenshottable line: The fastest RAG systems do not retrieve more. They retrieve with a budget and enforce it like an SLO.
The architecture that keeps FastAPI fast
Understanding the architecture matters because latency is usually dominated by where you put work, not how clever your retrieval is.
The codebase you are working from already encodes the right macro-pattern: async-first at the HTTP layer, background-first for heavy workflows. The trick is to make that separation explicit in your RAG design so you do not accidentally drag ingestion costs into the request path.
A budgeted RAG architecture in this FastAPI setup looks like this:
Offline or background ingestion path (Celery + Redis)
Parse and normalize text
Split into chunks
Embed chunks
Persist chunks and embeddings
Optionally insert into a knowledge-graph pipeline (LightRAG) with progress tracking
Online request path (FastAPI)
Authenticate and scope the request (user, case)
Retrieve top-k chunks from the active store (PGVector similarity_search with filter)
Assemble context under a token budget
Call the LLM microservice (network I/O bounded)
Return answer plus references
The consequence of not drawing this line is that your “RAG endpoint” becomes a grab bag of slow operations. Under load, you will see timeouts, connection pool exhaustion, and tail latency spikes that have nothing to do with the model.
The fix is structural: enforce that ingestion and enrichment are never on the critical path. In this repository’s design, that means Celery owns parsing, chunking, embedding, and optional graph insertion. FastAPI owns retrieval, assembly, and the LLM call.
Screenshottable line: If chunking or embedding happens inside your /rag request, you do not have a RAG endpoint. You have an outage generator.
A budgeted RAG assembly algorithm you can actually implement
This matters because “set k=3” is not a strategy. It is a guess. You need an assembly algorithm that makes trade-offs explicit and repeatable.
Below is a practical algorithm that matches the system you have: case-scoped retrieval using PGVector, optional graph retrieval present but not the active runtime path, and a FastAPI layer designed to stay thin.
Start by defining three budgets:
Retrieval budget: max number of chunks you are willing to consider (k_retrieve)
Context budget: max tokens or characters you will send as context (t_context)
Latency budget: max time you will spend assembling context before you degrade gracefully (t_assemble_ms)
Then implement assembly as a deterministic pipeline.
Scope first, then retrieve Why it matters: scoping is your highest-leverage latency control because it reduces the search space before you pay for similarity search.
In this codebase, scoping is naturally case-based. Apply it aggressively:
Filter retrieval by case_id.
When possible, filter further by document IDs the user selected (common in the repository’s RFQ and context-doc patterns).
Gate access to expensive endpoints with FastAPI dependencies (get_current_user, require_admin) so you do not expose broad retrieval surfaces by accident.
Consequence if you skip this: you will compensate by increasing k, which increases both retrieval time and context size, and you still get worse relevance.
Fix: treat scoping as a required input to retrieval, not an optional metadata field.
Retrieve more than you need, but only within a hard cap Why it matters: you want enough candidates to survive deduplication and budget trimming, but you do not want unbounded fan-out.
A practical pattern is:
k_retrieve = 3x to 5x the number of chunks you expect to include
Always apply metadata filters at query time (PGVector similarity_search(query, k=..., filter=...))
Consequence if you retrieve exactly what you plan to send: any noisy chunk forces you to either ship bad context or re-query, which adds latency.
Fix: retrieve a bounded candidate set, then assemble with a budget.
Assemble context with a token-aware packer Why it matters: “join all chunk texts” is how context blows up. You need a packer that stops when the budget is hit.
A simple, effective packer:
Deduplicate by (doc_id, chunk_id) or by chunk hash
Prefer diversity across documents when relevance scores are close
Pack chunks until you hit t_context
Add lightweight provenance per chunk (doc title, source link, chunk index) so you can return references without extra queries
Consequence if you ignore packing: you will exceed context limits or inflate generation latency. Longer context also increases the chance the model focuses on the wrong evidence.
Fix: make the packer deterministic and budget-driven. If you cannot fit a chunk, you do not “squeeze it in.” You drop it.
Degrade gracefully when budgets are exceeded Why it matters: production systems need a degraded mode that preserves correctness even when caches fail, stores are slow, or the query is pathological.
Degradation options that fit this architecture:
If retrieval is slow, reduce k_retrieve and proceed.
If assembly hits t_assemble_ms, stop packing and proceed with what you have.
If no chunks fit, call the LLM with no context and explicitly label the answer as ungrounded in your response schema.
Consequence if you do not degrade: you will time out requests and create retry storms, which makes everything slower.
Fix: implement time-bounded assembly and a “no-context” fallback path.
Cache the assembled context, not just the final answer Why it matters: repeated queries often differ slightly, and caching only the final answer is brittle. Caching the assembled context reduces repeated retrieval and packing work while preserving the ability to regenerate.
The repository already uses Redis-based read-through caching with TTL and namespace-prefix invalidation, with the important property that cache failure should not break correctness.
Consequence if you cache incorrectly: stale context becomes a silent correctness bug.
Fix: cache with explicit keys that include:
case_id
retrieval parameters (k_retrieve, filters)
an index freshness marker (more on this below)
The production lessons hiding in the “simple” parts
Most latency failures come from the boring edges: freshness, concurrency, and where you pay network I/O.
The first lesson is that freshness beats cleverness. In most production systems, the most common RAG bug is not “bad embeddings.” It is “the user uploaded a document and the system answered as if it did not exist.” Your architecture already anticipates this by pushing ingestion into Celery and tracking document status for polling. Use that status in the request path:
If a case has pending ingestion, either block retrieval for that document set or answer with a “still indexing” state.
Do not pretend the index is fresh when it is not.
The second lesson is that async does not save you from slow dependencies. Your FastAPI endpoint can be async, but if it fans out into multiple network calls (vector store, LLM microservice, optional graph), your tail latency will still spike. The repository’s current runtime truth is that PGVector is the active retrieval path, while LightRAG exists but is not the default. Treat that as a latency decision:
Keep the default path single-store for predictable performance.
If you add hybrid retrieval (PGVector + graph), do it behind a feature flag and measure p95, not averages.
The third lesson is that “structured generation” is a latency tool, not just a quality tool. The proposal-generation flow described in the codebase retrieves per section, assembles per-section context, and runs section generation concurrently before merging. That pattern is an implicit budget:
Each section has its own small context budget.
Retrieval is targeted, not global.
Concurrency is controlled at the section level.
If you instead retrieve one giant context for the entire proposal, you pay in tokens and latency, and you make it harder for the model to stay on-topic.
The Context Budget Triangle you can reuse
This framework came from real systems, not theory, and it is the fastest way to reason about RAG latency without guessing.
You are always trading off three budgets:
Scope budget: how wide your eligible corpus is (case, document IDs, metadata filters)
Evidence budget: how much text you can send (tokens, chunks, diversity)
Time budget: how long you can spend assembling before you must answer (p95 target)
The mistake is trying to optimize evidence budget directly by tuning k. The lever order that works in practice is:
Tighten scope until retrieval is cheap and relevant
Case scoping is your default.
Document scoping is your accelerator.
Access control dependencies prevent accidental wide queries.
Enforce evidence with a packer, not with hope
Retrieve a bounded candidate set.
Pack deterministically to a token budget.
Prefer provenance and diversity over raw volume.
Enforce time with explicit degradation
Cap assembly time.
Reduce k under load.
Fall back to no-context answers when necessary.
When you apply this triangle, you stop asking “what k should I use?” and start asking “what scope and time budgets make k safe?”
Screenshottable line: k is not a retrieval parameter. It is a latency commitment you make on behalf of your entire system.
The real takeaway is that context is an engineered artifact
Treating context as an engineered artifact changes how you build RAG endpoints in FastAPI.
You precompute what you can in Celery so the request path stays thin. You scope retrieval so you do not pay to search irrelevant data. You assemble context with a packer that enforces a token budget. You degrade gracefully under time pressure. And you make freshness visible so the system never lies about what it knows.
A fast RAG endpoint is not the one that retrieves the most relevant chunks. It is the one that can prove it stayed within its budgets while doing it.
Will you enforce your context budget at the packer, or will you keep paying for it at p95 when traffic shows up?
