Blog/

Advanced

15 min read

Building a RAG Pipeline with MCP: A Practical Architecture

Production RAG architecture: hybrid search (vector + BM25), intelligent chunking, hybrid search with RRF reranking, context management.

LL

Lee Li

Independent Developer · MCP Enthusiast

·

Building a RAG Pipeline with MCP: A Practical Architecture

RAG (Retrieval-Augmented Generation) is one of the top production use cases for MCP. Building a RAG pipeline sounds straightforward—chunk text, embed, retrieve, inject—but the details determine whether your RAG system actually helps or just adds latency.

Transparency note: The architecture and code in this article are based on patterns I have seen work in production and internal experiments. I do not have a publicly available benchmark repository for RAG-specific numbers. Where I cite specific performance numbers, I will note the context. The general patterns (chunk size, hybrid search, context management) reflect widely accepted practices in the RAG community.

Chunking: The Foundation Everything Else Depends On

Chunk size is the most important hyperparameter in RAG. Too small (under 100 tokens): you lose context and the retrieved chunks are uninformative. Too large (over 2000 tokens): precision drops because irrelevant content dilutes relevant content.

What I have observed in practice: For prose content, 500-800 tokens with 100-token overlap is a reasonable starting point. This preserves paragraph-level context while keeping chunks small enough to be precise.

For code: do not use token-based chunking. Use tree-sitter to parse by AST node—functions, classes, modules. Code has intrinsic structure that token-based chunking destroys.

Note on chunking strategies: There is no universally optimal chunk size. The right size depends on your content type, embedding model, and retrieval task. Test with your actual content before committing to a chunk size.

import tiktoken

def chunk_text(text: str, chunk_size=600, overlap=100) -> list[str]:
"""
Token-based text chunking with overlap.
Note: chunk_size is in tokens, not characters.
"""
encoder = tiktoken.get_encoding('cl100k_base')
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
chunks.append(chunk_text)
start = end - overlap
return chunks

Embedding Model Selection

Embedding model selection determines what "similar" means in your vector search. For technical content (code, documentation): use models trained on code—OpenAI's text-embedding-3-large or CodeBERT. For general prose: text-embedding-3-large or Cohere's embed models.

Important: Benchmark your embedding model on your specific content. A model that performs well on general benchmarks may underperform on your domain-specific content.

Embedding dimension matters for storage and retrieval speed. text-embedding-3-small (256 dimensions) is faster to search than text-embedding-3-large (3072 dimensions) at the cost of some accuracy. For large corpora where search speed matters, consider dimensionality reduction after embedding.

Hybrid Search: Vector Similarity Plus Keyword Matching

Pure vector search misses exact matches. A query for "Error 500" will not match a chunk that contains "HTTP 500 Internal Server Error" if the embedding model does not place these close in vector space.

Hybrid search combines vector similarity with BM25 keyword matching. In my experience, this typically improves retrieval accuracy by 10-30% for queries with specific terms, while maintaining the semantic understanding of vector search.

async def hybrid_search(query, namespace, top_k=5):
"""
Hybrid search combining vector and keyword-based retrieval.
Note: Actual implementation requires a vector DB and BM25 library.
"""
vector_results = await vector_search(embed(query), namespace, top_k * 2)
keyword_results = await bm25_search(query, namespace, top_k * 2)

# Reciprocal Rank Fusion
rrf_scores = defaultdict(float)
for rank, r in enumerate(vector_results):
rrf_scores[r.id] += 1 / (60 + rank)
for rank, r in enumerate(keyword_results):
rrf_scores[r.id] += 1 / (60 + rank)

return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

When you might not need hybrid search: If your queries are primarily semantic (conceptual, exploratory) rather than specific (error codes, exact product names), pure vector search may be sufficient.

Context Management: Staying Within the LLM's Context Budget

When your retrieved chunks exceed the LLM's context budget, you need a strategy. Based on what has worked for me:

  • Re-rank results with a cross-encoder model (more accurate but slower)
  • Summarize chunks to 200 tokens before passing to LLM
  • For complex queries, use iterative retrieval (retrieve, summarize, retrieve again)
  • async def build_context(query, max_tokens=4000):
    """
    Build context from retrieved chunks within token budget.
    Note: count_tokens is implementation-dependent.
    """
    chunks = await hybrid_search(query, top_k=10)
    reranked = await cross_encoder.rerank(query, [c.content for c in chunks])
    context = ""
    for chunk in reranked:
    if len(context) + count_tokens(chunk.content) > max_tokens:
    break
    context += f"

    ---
    Source: {chunk.source}
    {chunk.content}"
    return context

    Performance Numbers (With Caveats)

    I have seen hybrid search on a 10M chunk corpus across 5 namespaces achieve approximately P50 ~200ms, P99 ~600ms in internal testing. Important caveats:

  • These numbers are from one specific production system I worked with, not a general benchmark
  • Your vector database, embedding model, and query complexity will produce different results
  • P99 is highly dependent on vector database load and indexing strategy
  • Do not treat these numbers as benchmarks. Profile your specific system under your actual workload.

    Vector Database Selection

    For corpora under 1M chunks: Pinecone, Weaviate, or Qdrant are all reasonable choices. For 1M-10M chunks: consider pgvector (PostgreSQL extension) for simpler deployment or Weaviate for better performance. For 10M+ chunks: Vespa or Milvus for horizontal scaling.

    pgvector underrated for mid-size deployments: If you are already running PostgreSQL, adding pgvector avoids a new system. Performance is competitive with dedicated vector databases for corpora under 5M chunks.

    Evaluation: How to Know If Your RAG Is Working

    RAG systems are hard to evaluate because relevance is subjective. Build a test set: 50 queries, each with a golden retrieved chunk and expected response. Run your RAG pipeline against the test set monthly.

    Retrieval accuracy: does the pipeline retrieve the right chunks? Use hit rate (is the relevant chunk in top-k?) and MRR (mean reciprocal rank of first relevant chunk).

    Answer quality: does the LLM generate correct answers given the retrieved context? This is harder to measure—use LLM-as-judge with a reference answer for automated scoring, supplemented by human evaluation for edge cases.

    When NOT to Follow This Architecture

    This architecture (hybrid search, cross-encoder reranking, token budget management) is appropriate for:

  • Large document collections where retrieval precision matters
  • Technical content where exact matches (error codes, product names) are important
  • Use cases where recall (finding all relevant chunks) is as important as precision
  • You may not need this complexity if:

  • Your corpus is small (under 10K chunks)—simple vector search may suffice
  • Your queries are always semantic/conceptual rather than specific
  • Latency is more important than retrieval accuracy
  • You have no ML/embedding infrastructure
  • Common Questions

    Q: Should I use hybrid search from the start?
    A: Start with pure vector search. Add BM25 if you find that specific term queries ("Error 500", product names) are returning poor results.

    Q: How do I choose chunk size?
    A: Test with your actual content. A reasonable starting point: 500-800 tokens for prose, AST-based for code. Measure retrieval accuracy on a held-out query set.

    Q: When is cross-encoder reranking worth the latency cost?
    A: When retrieval accuracy matters more than latency. Cross-encoder adds 50-200ms per query but typically improves top-1 accuracy by 5-15%.

    Related Tools

  • [Exa MCP Server](/tools/exa-mcp-server) — Neural search for AI pipelines. One of the two tools used in our RAG architecture comparisons.
  • [Firecrawl MCP Server](/tools/firecrawl-mcp-server) — Document ingestion for RAG. How we populated the knowledge base in our test pipeline.
  • LL

    Lee Li

    Independent Developer · MCP Enthusiast

    Building and breaking things with AI tools since 2023. MCP Find started as a personal project to track the rapidly evolving MCP ecosystem. Based in Hong Kong.

    info@mcp-find.org📍 Sai Kung, Kowloon, Hong Kong

    Sponsored