Building a RAG Pipeline with MCP: A Practical Architecture

RAG (Retrieval-Augmented Generation) is one of the top production use cases for MCP. Building a RAG pipeline sounds straightforward—chunk text, embed, retrieve, inject—but the details determine whether your RAG system actually helps or just adds latency.

Transparency note: The architecture and code in this article are based on patterns I have seen work in production and internal experiments. I do not have a publicly available benchmark repository for RAG-specific numbers. Where I cite specific performance numbers, I will note the context. The general patterns (chunk size, hybrid search, context management) reflect widely accepted practices in the RAG community.

Chunking: The Foundation Everything Else Depends On

Chunk size is the most important hyperparameter in RAG. Too small (under 100 tokens): you lose context and the retrieved chunks are uninformative. Too large (over 2000 tokens): precision drops because irrelevant content dilutes relevant content.

What I have observed in practice: For prose content, 500-800 tokens with 100-token overlap is a reasonable starting point. This preserves paragraph-level context while keeping chunks small enough to be precise.

For code: do not use token-based chunking. Use tree-sitter to parse by AST node—functions, classes, modules. Code has intrinsic structure that token-based chunking destroys.

Note on chunking strategies: There is no universally optimal chunk size. The right size depends on your content type, embedding model, and retrieval task. Test with your actual content before committing to a chunk size.

import tiktoken

def chunk_text(text: str, chunk_size=600, overlap=100) -> list[str]:
    """
    Token-based text chunking with overlap.
    Note: chunk_size is in tokens, not characters.
    """
    encoder = tiktoken.get_encoding('cl100k_base')
    tokens = encoder.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append(chunk_text)
        start = end - overlap
    return chunks

Embedding Model Selection

Embedding model selection determines what "similar" means in your vector search. For technical content (code, documentation): use models trained on code—OpenAI's text-embedding-3-large or CodeBERT. For general prose: text-embedding-3-large or Cohere's embed models.

Important: Benchmark your embedding model on your specific content. A model that performs well on general benchmarks may underperform on your domain-specific content.

Embedding dimension matters for storage and retrieval speed. text-embedding-3-small (256 dimensions) is faster to search than text-embedding-3-large (3072 dimensions) at the cost of some accuracy. For large corpora where search speed matters, consider dimensionality reduction after embedding.

Hybrid Search: Vector Similarity Plus Keyword Matching

Pure vector search misses exact matches. A query for "Error 500" will not match a chunk that contains "HTTP 500 Internal Server Error" if the embedding model does not place these close in vector space.

Hybrid search combines vector similarity with BM25 keyword matching. In my experience, this typically improves retrieval accuracy by 10-30% for queries with specific terms, while maintaining the semantic understanding of vector search.

async def hybrid_search(query, namespace, top_k=5):
    """
    Hybrid search combining vector and keyword-based retrieval.
    Note: Actual implementation requires a vector DB and BM25 library.
    """
    vector_results = await vector_search(embed(query), namespace, top_k * 2)
    keyword_results = await bm25_search(query, namespace, top_k * 2)

    # Reciprocal Rank Fusion
    rrf_scores = defaultdict(float)
    for rank, r in enumerate(vector_results):
        rrf_scores[r.id] += 1 / (60 + rank)
    for rank, r in enumerate(keyword_results):
        rrf_scores[r.id] += 1 / (60 + rank)

    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

When you might not need hybrid search: If your queries are primarily semantic (conceptual, exploratory) rather than specific (error codes, exact product names), pure vector search may be sufficient.

Context Management: Staying Within the LLM's Context Budget

When your retrieved chunks exceed the LLM's context budget, you need a strategy. Based on what has worked for me:

Re-rank results with a cross-encoder model (more accurate but slower)

Summarize chunks to 200 tokens before passing to LLM

For complex queries, use iterative retrieval (retrieve, summarize, retrieve again)

async def build_context(query, max_tokens=4000):
    """
    Build context from retrieved chunks within token budget.
    Note: count_tokens is implementation-dependent.
    """
    chunks = await hybrid_search(query, top_k=10)
    reranked = await cross_encoder.rerank(query, [c.content for c in chunks])
    context = ""
    for chunk in reranked:
        if len(context) + count_tokens(chunk.content) > max_tokens:
            break
        context += f"

---
Source: {chunk.source}
{chunk.content}"
    return context

Performance Numbers (With Caveats)

I have seen hybrid search on a 10M chunk corpus across 5 namespaces achieve approximately P50 ~200ms, P99 ~600ms in internal testing. Important caveats:

These numbers are from one specific production system I worked with, not a general benchmark

Your vector database, embedding model, and query complexity will produce different results

P99 is highly dependent on vector database load and indexing strategy

Do not treat these numbers as benchmarks. Profile your specific system under your actual workload.

Vector Database Selection

For corpora under 1M chunks: Pinecone, Weaviate, or Qdrant are all reasonable choices. For 1M-10M chunks: consider pgvector (PostgreSQL extension) for simpler deployment or Weaviate for better performance. For 10M+ chunks: Vespa or Milvus for horizontal scaling.

pgvector underrated for mid-size deployments: If you are already running PostgreSQL, adding pgvector avoids a new system. Performance is competitive with dedicated vector databases for corpora under 5M chunks.

Evaluation: How to Know If Your RAG Is Working

RAG systems are hard to evaluate because relevance is subjective. Build a test set: 50 queries, each with a golden retrieved chunk and expected response. Run your RAG pipeline against the test set monthly.

Retrieval accuracy: does the pipeline retrieve the right chunks? Use hit rate (is the relevant chunk in top-k?) and MRR (mean reciprocal rank of first relevant chunk).

Answer quality: does the LLM generate correct answers given the retrieved context? This is harder to measure—use LLM-as-judge with a reference answer for automated scoring, supplemented by human evaluation for edge cases.

When NOT to Follow This Architecture

This architecture (hybrid search, cross-encoder reranking, token budget management) is appropriate for:

Large document collections where retrieval precision matters

Technical content where exact matches (error codes, product names) are important

Use cases where recall (finding all relevant chunks) is as important as precision

You may not need this complexity if:

Your corpus is small (under 10K chunks)—simple vector search may suffice

Your queries are always semantic/conceptual rather than specific

Latency is more important than retrieval accuracy

You have no ML/embedding infrastructure

Common Questions

Q: Should I use hybrid search from the start?
A: Start with pure vector search. Add BM25 if you find that specific term queries ("Error 500", product names) are returning poor results.

Q: How do I choose chunk size?
A: Test with your actual content. A reasonable starting point: 500-800 tokens for prose, AST-based for code. Measure retrieval accuracy on a held-out query set.

Q: When is cross-encoder reranking worth the latency cost?
A: When retrieval accuracy matters more than latency. Cross-encoder adds 50-200ms per query but typically improves top-1 accuracy by 5-15%.

Related Tools

[Exa MCP Server](/tools/exa-mcp-server) — Neural search for AI pipelines. One of the two tools used in our RAG architecture comparisons.

[Firecrawl MCP Server](/tools/firecrawl-mcp-server) — Document ingestion for RAG. How we populated the knowledge base in our test pipeline.

Building a RAG Pipeline with MCP: A Practical Architecture

Building a RAG Pipeline with MCP: A Practical Architecture

Chunking: The Foundation Everything Else Depends On

Embedding Model Selection

Hybrid Search: Vector Similarity Plus Keyword Matching

Context Management: Staying Within the LLM's Context Budget

Performance Numbers (With Caveats)

Vector Database Selection

Evaluation: How to Know If Your RAG Is Working

When NOT to Follow This Architecture

Common Questions

Related Tools

Lee Li

MCP in Production: What Breaks After Localhost

The First Useful Thing MCP Gave Me Was Fewer Wrong Assumptions

MCP Ecosystem in 2026: What Actually Matters