Building a RAG Pipeline with MCP: A Practical Architecture
Production RAG architecture: hybrid search (vector + BM25), intelligent chunking, hybrid search with RRF reranking, context management.
Lee Li
Independent Developer · MCP Enthusiast
Building a RAG Pipeline with MCP: A Practical Architecture
RAG (Retrieval-Augmented Generation) is one of the top production use cases for MCP. Building a RAG pipeline sounds straightforward—chunk text, embed, retrieve, inject—but the details determine whether your RAG system actually helps or just adds latency.
Transparency note: The architecture and code in this article are based on patterns I have seen work in production and internal experiments. I do not have a publicly available benchmark repository for RAG-specific numbers. Where I cite specific performance numbers, I will note the context. The general patterns (chunk size, hybrid search, context management) reflect widely accepted practices in the RAG community.
Chunking: The Foundation Everything Else Depends On
Chunk size is the most important hyperparameter in RAG. Too small (under 100 tokens): you lose context and the retrieved chunks are uninformative. Too large (over 2000 tokens): precision drops because irrelevant content dilutes relevant content.
What I have observed in practice: For prose content, 500-800 tokens with 100-token overlap is a reasonable starting point. This preserves paragraph-level context while keeping chunks small enough to be precise.
For code: do not use token-based chunking. Use tree-sitter to parse by AST node—functions, classes, modules. Code has intrinsic structure that token-based chunking destroys.
Note on chunking strategies: There is no universally optimal chunk size. The right size depends on your content type, embedding model, and retrieval task. Test with your actual content before committing to a chunk size.
import tiktoken
def chunk_text(text: str, chunk_size=600, overlap=100) -> list[str]:
"""
Token-based text chunking with overlap.
Note: chunk_size is in tokens, not characters.
"""
encoder = tiktoken.get_encoding('cl100k_base')
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
start = end - overlap
return chunks
Embedding Model Selection
Embedding model selection determines what "similar" means in your vector search. For technical content (code, documentation): use models trained on code—OpenAI's text-embedding-3-large or CodeBERT. For general prose: text-embedding-3-large or Cohere's embed models.
Important: Benchmark your embedding model on your specific content. A model that performs well on general benchmarks may underperform on your domain-specific content.
Embedding dimension matters for storage and retrieval speed. text-embedding-3-small (256 dimensions) is faster to search than text-embedding-3-large (3072 dimensions) at the cost of some accuracy. For large corpora where search speed matters, consider dimensionality reduction after embedding.
Hybrid Search: Vector Similarity Plus Keyword Matching
Pure vector search misses exact matches. A query for "Error 500" will not match a chunk that contains "HTTP 500 Internal Server Error" if the embedding model does not place these close in vector space.
Hybrid search combines vector similarity with BM25 keyword matching. In my experience, this typically improves retrieval accuracy by 10-30% for queries with specific terms, while maintaining the semantic understanding of vector search.
async def hybrid_search(query, namespace, top_k=5):
"""
Hybrid search combining vector and keyword-based retrieval.
Note: Actual implementation requires a vector DB and BM25 library.
"""
vector_results = await vector_search(embed(query), namespace, top_k * 2)
keyword_results = await bm25_search(query, namespace, top_k * 2)
# Reciprocal Rank Fusion
rrf_scores = defaultdict(float)
for rank, r in enumerate(vector_results):
rrf_scores[r.id] += 1 / (60 + rank)
for rank, r in enumerate(keyword_results):
rrf_scores[r.id] += 1 / (60 + rank)
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
When you might not need hybrid search: If your queries are primarily semantic (conceptual, exploratory) rather than specific (error codes, exact product names), pure vector search may be sufficient.
Context Management: Staying Within the LLM's Context Budget
When your retrieved chunks exceed the LLM's context budget, you need a strategy. Based on what has worked for me:
async def build_context(query, max_tokens=4000):
"""
Build context from retrieved chunks within token budget.
Note: count_tokens is implementation-dependent.
"""
chunks = await hybrid_search(query, top_k=10)
reranked = await cross_encoder.rerank(query, [c.content for c in chunks])
context = ""
for chunk in reranked:
if len(context) + count_tokens(chunk.content) > max_tokens:
break
context += f"
---
Source: {chunk.source}
{chunk.content}"
return context
Performance Numbers (With Caveats)
I have seen hybrid search on a 10M chunk corpus across 5 namespaces achieve approximately P50 ~200ms, P99 ~600ms in internal testing. Important caveats:
Do not treat these numbers as benchmarks. Profile your specific system under your actual workload.
Vector Database Selection
For corpora under 1M chunks: Pinecone, Weaviate, or Qdrant are all reasonable choices. For 1M-10M chunks: consider pgvector (PostgreSQL extension) for simpler deployment or Weaviate for better performance. For 10M+ chunks: Vespa or Milvus for horizontal scaling.
pgvector underrated for mid-size deployments: If you are already running PostgreSQL, adding pgvector avoids a new system. Performance is competitive with dedicated vector databases for corpora under 5M chunks.
Evaluation: How to Know If Your RAG Is Working
RAG systems are hard to evaluate because relevance is subjective. Build a test set: 50 queries, each with a golden retrieved chunk and expected response. Run your RAG pipeline against the test set monthly.
Retrieval accuracy: does the pipeline retrieve the right chunks? Use hit rate (is the relevant chunk in top-k?) and MRR (mean reciprocal rank of first relevant chunk).
Answer quality: does the LLM generate correct answers given the retrieved context? This is harder to measure—use LLM-as-judge with a reference answer for automated scoring, supplemented by human evaluation for edge cases.
When NOT to Follow This Architecture
This architecture (hybrid search, cross-encoder reranking, token budget management) is appropriate for:
You may not need this complexity if:
Common Questions
Q: Should I use hybrid search from the start?
A: Start with pure vector search. Add BM25 if you find that specific term queries ("Error 500", product names) are returning poor results.
Q: How do I choose chunk size?
A: Test with your actual content. A reasonable starting point: 500-800 tokens for prose, AST-based for code. Measure retrieval accuracy on a held-out query set.
Q: When is cross-encoder reranking worth the latency cost?
A: When retrieval accuracy matters more than latency. Cross-encoder adds 50-200ms per query but typically improves top-1 accuracy by 5-15%.
Related Tools
Lee Li
Independent Developer · MCP Enthusiast
Building and breaking things with AI tools since 2023. MCP Find started as a personal project to track the rapidly evolving MCP ecosystem. Based in Hong Kong.