What are the key concepts covered in this guide?

First, I enabled basic memory logging to confirm the leak. I added a simple cron job that logged memory usage of the MCP process every 10 minutes. What I saw was that memory usage grew steadily, 40-60MB per hour, even when there were no requests coming in. That’s a classic sign of a memory leak: memory isn’t being released after requests are done.. Next, I took a heap snapshot to see where the memory was going. For Node.js, I used clinic.js, a great open source tool for profiling Node.js apps. I ran the server under load for an hour, took a heap snapshot, and immediately saw the problem: 85% of the memory was being used by a global object I was using to cache embeddings.. **Add caching:** Caching reduces the number of requests you send to upstream APIs by 80-90% in most cases, which eliminates most rate limit issues before they start. This is a side benefit of the LRU cache I added for the memory leak fix..

Who is this guide for?

This guide is suitable for intermediate level developers looking to understand or implement MCP in their projects.

How does MCP relate to AI development?

The Model Context Protocol (MCP) is an open standard developed by Anthropic that enables AI models like Claude and Cursor to connect with external tools, data sources, and APIs through a standardized interface.

intermediateUse-casePrimary16 min read

Deploying MCP Servers to Production: Lessons from Real Failures

Overview

Deploying MCP Servers to Production: Lessons from Real Failures When I built my first Model Context Protocol (MCP) server 18 months ago, I thought deployment would be trivial. MCP servers are lightweight services that just fetch and format context for large la

Key Concepts

• First, I enabled basic memory logging to confirm the leak. I added a simple cron job that logged memory usage of the MCP process every 10 minutes. What I saw was that memory usage grew steadily, 40-60MB per hour, even when there were no requests coming in. That’s a classic sign of a memory leak: memory isn’t being released after requests are done.
• Next, I took a heap snapshot to see where the memory was going. For Node.js, I used clinic.js, a great open source tool for profiling Node.js apps. I ran the server under load for an hour, took a heap snapshot, and immediately saw the problem: 85% of the memory was being used by a global object I was using to cache embeddings.
• **Add caching:** Caching reduces the number of requests you send to upstream APIs by 80-90% in most cases, which eliminates most rate limit issues before they start. This is a side benefit of the LRU cache I added for the memory leak fix.
• **Add concurrency limiting and request queuing:** I use the `p-queue` package for Node.js to limit the number of concurrent requests I send to any upstream API. If you have 25 concurrent requests, but you limit concurrency to 5, the other 20 queue up and wait for a slot, instead of all hitting the API at once and getting rate limited.
• **Add retries with exponential backoff for 429 errors:** Even with concurrency limiting, you’ll get occasional 429s. Retrying with backoff lets you automatically recover from transient rate limit errors without the user having to refresh.
• **Use multiple API keys for high throughput:** Most API providers give you a rate limit per API key, so splitting requests across multiple keys doubles or triples your total rate limit for no extra cost.

When I built my first Model Context Protocol (MCP) server 18 months ago, I thought deployment would be trivial. MCP servers are lightweight services that just fetch and format context for large language models, right? I got it running locally on my laptop in an afternoon, wired it up to our team’s internal chatbot, and immediately shared the connection string with the product team. That was my first mistake. Over the next six months, I deployed three iterations of that same MCP server to production, and I broke it three separate, very public, very annoying ways. Those failures taught me more about production MCP deployment than any documentation ever could. In this guide, I’ll walk through everything I learned, from why local MCP servers never work for teams, to how I fixed three catastrophic production failures, to what my current production setup costs and how it’s held up for 12 months.

Why Local Servers Aren’t Enough for Teams

It’s tempting to just leave your MCP server running locally, especially when you’re the only one using it. I test new MCP tools locally all the time, and it’s fast and free. But as soon as you have more than one person who needs access, local deployments fall apart for four key reasons I’ve experienced firsthand.

First, availability: My laptop goes to sleep when I’m in meetings, I turn it off when I go home, and I reboot it when I install OS updates. There were multiple weeks where the product team couldn’t get context for their user research because my laptop was off. Even if you leave a desktop running 24/7, your home internet goes down occasionally, and you’ll have to deal with dynamic IP changes that break the connection for your team.

Second, dependency drift: When I shared the repo with another engineer to test, they got a weird embedding error that I never saw. We spent two hours debugging just to find out they had a minor version difference in the OpenAI SDK that changed how embeddings were formatted. Local environments are unique to every machine, and even a small difference can break your MCP server in subtle ways that are hard to debug.

Third, no access control: Local servers don’t come with any built-in authentication. If you expose a local server to the internet to let your team access it, you’re basically leaving your front door open. Anyone can hit your endpoint, use your API keys, and access your internal data. I’ve heard multiple stories of teams accidentally exposing their internal MCP server to the public, leading to thousands of dollars in unexpected API bills from stolen keys.

Fourth, no scalability: When three people are hitting your local server at the same time, it’s fine, but when 20 people are querying it during a sprint planning meeting, your laptop’s network and memory get overwhelmed, and responses slow to a crawl. I once had my laptop freeze completely during a company-wide demo because 15 people all hit my local MCP server at the same time. That’s not a mistake I make twice.

The bottom line is: If your MCP server is used by more than just you, it needs to be deployed to a production environment that’s always available, consistent, and secure.

Deployment Options: Docker, Cloud Functions, VMs (Tradeoffs Included)

Once you decide you need a production deployment, you have three main options to choose from, each with clear tradeoffs I’ve tested firsthand. Let’s break them down.

Docker Containers

Docker is my default starting point for almost any MCP server. It packages your server, all its dependencies, and your environment config into a single image that runs exactly the same anywhere, from your local laptop to a cloud VM to a managed orchestration service like ECS or Kubernetes. The biggest advantages of Docker are consistency and portability: I build the image once, test it locally, and deploy it anywhere without worrying about dependency issues.

**Tradeoffs:** You still need somewhere to run the Docker container. If you’re running it on your own VM, you have to manage container restarts, updates, and networking. If you use a managed container service like Fargate or GCP Cloud Run, that overhead goes away, but you pay a small premium for that management. Still, for 90% of production MCP use cases, Docker is the sweet spot.

Serverless Cloud Functions

The big draw here is pay-per-use pricing: you only pay for the time your server is actually handling requests, so if you have low traffic, it can be almost free. It also auto-scales instantly, so you never have to worry about a spike of requests overwhelming your server.

**Tradeoffs:** There are two big gotchas that make it a bad fit for most MCP servers. First, cold starts: If your server hasn’t handled a request in a few minutes, the provider spins it down, and the next request has to wait for it to spin back up, which can add 2-5 seconds of latency. For a tool that your team is using to get quick context, that extra latency is incredibly annoying. Second, many MCP workflows use long-lived connections for streaming context updates as the LLM builds its response. Most serverless providers kill connections after a few seconds, which breaks streaming. I tested a serverless deployment of my MCP server for a month, and half of my team complained about constantly getting disconnected mid-response. I switched away pretty quickly.

That said, if you have a very low-traffic MCP server that only gets a few requests a day, and you don’t use streaming, serverless can work great and save you a lot of money.

Virtual Machines

VMs give you full control over your entire environment. You can install any dependency you want, adjust any kernel setting, and run as many MCP servers on one VM as your resource limit allows. For a small team running a couple of MCP servers, a single $12/month 2GB DigitalOcean Droplet can handle everything you need, which is way cheaper than a managed container service.

**Tradeoffs:** You have to do all the maintenance yourself: patch the OS, set up backups, configure firewalls, manage container restarts if your server crashes. If you’re comfortable with that, it’s a great low-cost option. If you don’t want to spend Friday afternoon debugging a failed OS update, you’re better off with a managed service.

For me, I’ve gone back and forth: when I was testing my first MCP server, I ran it on a cheap DigitalOcean VM, and it worked fine. Now that it’s a critical tool used by the whole company, I run it on managed Fargate to avoid the maintenance overhead.

---

Failure 1: Random Connection Timeouts (And The Fix That Works)

My first production deployment was on that $12 DigitalOcean VM I just mentioned. I got the Docker container running, exposed port 8080 directly to the internet, added a firewall rule allowing all incoming traffic, and shared it with the team. Within an hour, I started getting messages: half the time people try to get context, they get a connection timeout error. Sometimes it works, sometimes it doesn’t.

I couldn’t reproduce it when I tested it from my desk, because my requests were fast and finished in 10 seconds. The problem only happened when people left the connection open to stream context, which would take 60 seconds or more for large docs. I spent half a day debugging this, checking logs, changing network settings, until I finally used tcpdump to look at what was actually happening on the network.

What I found was that the DigitalOcean firewall was killing any idle connection that lasted longer than 60 seconds. Idle here just meant no data was being sent for a few seconds while the LLM processed the context, so the connection looked idle to the firewall. My MCP server also didn’t have TCP keep-alive enabled, so it never sent the small keep-alive packets that tell firewalls the connection is still active. So after 60 seconds, the firewall just dropped the connection, resulting in a timeout on the client side.

That’s a problem that almost no one talks about in MCP deployment guides, but it’s extremely common. Any cloud provider’s firewall or default load balancer will kill idle connections after 60-90 seconds, and most default MCP server setups don’t enable keep-alive.

The fix I settled on has two parts: add a reverse proxy (I use Nginx) in front of the MCP server, and configure both Nginx and the MCP server to use longer timeouts and enable TCP keep-alive. Here’s the exact Nginx config I use today for all my MCP servers, you can copy it directly:

```nginx

server {

listen 80;

server_name mcp.your-domain.com;

client_max_body_size 10M;

location / {

proxy_pass http://localhost:8080;

proxy_http_version 1.1;

proxy_connect_timeout 60s;

proxy_send_timeout 300s; # 5 minutes, enough for large context streams

proxy_read_timeout 300s;

proxy_set_header Connection "";

proxy_set_header Proxy-Connection "keep-alive";

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

}

```

Then, I also enable keep-alive on the Node.js/Express MCP server itself, with this runnable config:

```javascript

// MCP server keep-alive config for Node.js/Express

const server = app.listen(8080, () => {

console.log(`MCP server running on port 8080`);

});

// Enable TCP keep-alive with 30 second interval to prevent timeouts

server.keepAliveTimeout = 30 * 1000;

server.headersTimeout = 60 * 1000;

```

That’s all you need to fix 99% of random connection timeout issues for MCP servers. The tradeoff here is that longer timeouts mean more connections are held open at any given time, which uses a small amount of extra memory. For my setup, holding 100 open connections uses less than 20MB of extra memory, which is a tiny price to pay to eliminate random timeout errors. If you’re really tight on memory, you can drop the timeout to 2 minutes, which is still enough for almost all workflows.

---

Failure 2: Slow Memory Leak That Took Me Out (Debugging Process + Gotcha)

After I fixed the connection timeout issue, my MCP server was stable for a whole week. I was feeling pretty proud of myself, so I took a long weekend off to go hiking, left the server running on my DigitalOcean VM, and turned off my phone notifications. 2am on Sunday, I got a ping from our company Slack from our head of engineering: the entire internal chatbot was down, and the error was "connection refused to MCP server". I had no cell service where I was hiking, so I didn’t get it until I got back to town on Monday morning.

When I logged into the VM, I found the MCP server had been killed by the Linux OOM killer 24 hours earlier. It had run out of memory, crashed, and I didn’t have a process manager set up to restart it automatically. That was my second big failure, and it led me to find a memory leak that I never would have caught in short local tests.

The debugging process I went through is straightforward, and it works for any MCP server, regardless of what language you’re writing it in:

First, I enabled basic memory logging to confirm the leak. I added a simple cron job that logged memory usage of the MCP process every 10 minutes. What I saw was that memory usage grew steadily, 40-60MB per hour, even when there were no requests coming in. That’s a classic sign of a memory leak: memory isn’t being released after requests are done.
Next, I took a heap snapshot to see where the memory was going. For Node.js, I used clinic.js, a great open source tool for profiling Node.js apps. I ran the server under load for an hour, took a heap snapshot, and immediately saw the problem: 85% of the memory was being used by a global object I was using to cache embeddings.

I added caching to the MCP server to avoid re-calculating embeddings for the same doc over and over again. Embeddings are expensive and slow, so caching makes total sense. But I implemented it the lazy way: I just used a global JavaScript object, and added every new embedding to it, with no eviction policy. So every time someone requested a doc that wasn’t already cached, it got added to the object, and it just grew forever, no matter how big it got. That’s such a simple mistake, but I see it all the time in small MCP servers. No one thinks to add an eviction policy to a cache when they’re building the first version.

Here’s what the bad (leaking) code looked like, and what the fixed code looks like, using the popular `lru-cache` package for Node.js/TypeScript:

```typescript

// Bad: Unbounded cache, no eviction = memory leak

const embeddingCache: Record<string, number[]> = {};

export async function getEmbedding(docId: string, text: string): Promise<number[]> {

// Return cached embedding if it exists

if (embeddingCache[docId]) return embeddingCache[docId];

// Otherwise fetch from OpenAI

const embedding = await openai.embeddings.create({ input: text, model: "text-embedding-3-small" });

const vector = embedding.data[0].embedding;

// Cache forever

embeddingCache[docId] = vector;

return vector;

}

```

```typescript

// Fixed: Bounded LRU cache, automatic eviction of least used items

import LRU from 'lru-cache';

const embeddingCache = new LRU<string, number[]>({

maxSize: 200 * 1024 * 1024, // 200MB max cache size

sizeCalculation: (embedding) => embedding.length * 8, // Each number = 8 bytes

});

export async function getEmbedding(docId: string, text: string): Promise<number[]> {

// Return cached embedding if it exists

if (embeddingCache.has(docId)) return embeddingCache.get(docId)!;

// Otherwise fetch from OpenAI

const embedding = await openai.embeddings.create({ input: text, model: "text-embedding-3-small" });

const vector = embedding.data[0].embedding;

// Cache, automatic eviction when we hit max size

embeddingCache.set(docId, vector);

return vector;

}

```

That’s all it took to fix the memory leak. Now when the cache hits 200MB, it automatically evicts the least recently used embeddings to make room for new ones, so memory usage stays stable.

**Personal Gotcha:** The mistake that got me that long weekend is that you have to test memory usage over time. I only tested my server with 50 requests locally, which filled the cache with 50 embeddings, that’s nothing. I never let it run for a few days under real traffic to see what would happen. The leak was slow, so it took 2 days to use up all 2GB of memory on the VM. If I had done a 24-hour test before deploying, I would have caught it before I left for the weekend.

The tradeoff here is between cache size and memory usage. If you set the max cache size too small, you get more cache misses, which means slower responses and more costs for your upstream embedding API. If you set it too big, you still risk OOM kills. I tested different sizes for my use case: 200MB gives me an 88% cache hit ratio, which is good enough for my needs, and it leaves plenty of extra memory for the server and other processes. I’d rather have a slightly lower hit ratio than risk another OOM crash.

After fixing the leak, I also added a process manager like PM2 to restart the server automatically if it does crash, and set up an alert for high memory usage, so I get paged long before it runs out of memory. That’s a small extra step that saves a lot of headache.

---

Failure 3: Mass Rate Limiting Outage During Peak Traffic

The third failure I hit happened about a month after I fixed the memory leak. We had a big sprint planning meeting, and all 25 of our product and engineering team were using the chatbot to pull context about old user stories and product specs at the same time. Within 10 minutes, almost every request started failing with 429 Too Many Requests errors. I looked at the logs, and we had hit rate limits on three different upstream APIs: OpenAI embeddings, Notion (where we store our product docs), and our internal product API. I had completely forgotten to account for rate limits when I built the server. I just let every request go directly to the upstream API immediately, no queueing, no throttling, nothing. When we had a spike of 25 concurrent requests, we blew through the default rate limits in minutes.

The fix here has four parts, and I’ve tweaked it over time to work well for my use case:

**Add caching:** Caching reduces the number of requests you send to upstream APIs by 80-90% in most cases, which eliminates most rate limit issues before they start. This is a side benefit of the LRU cache I added for the memory leak fix.
**Add concurrency limiting and request queuing:** I use the `p-queue` package for Node.js to limit the number of concurrent requests I send to any upstream API. If you have 25 concurrent requests, but you limit concurrency to 5, the other 20 queue up and wait for a slot, instead of all hitting the API at once and getting rate limited.
**Add retries with exponential backoff for 429 errors:** Even with concurrency limiting, you’ll get occasional 429s. Retrying with backoff lets you automatically recover from transient rate limit errors without the user having to refresh.
**Use multiple API keys for high throughput:** Most API providers give you a rate limit per API key, so splitting requests across multiple keys doubles or triples your total rate limit for no extra cost.

**Tradeoffs:** Queuing adds a small amount of latency to requests during peak traffic. If you have 20 requests queued up, the last request has to wait a few seconds for its turn. But that’s way better than all 20 requests failing outright. I’ve tested with different concurrency limits: for OpenAI, I limit to 10 concurrent requests, which keeps me under the rate limit for the free tier, and the maximum wait time for a queued request is around 3 seconds, which most users don’t even notice. Another tradeoff: multiple API keys add a little management overhead, you have to store multiple keys, rotate them when they expire, etc. But it’s a one-time setup, and it’s worth it for avoiding rate limits during peak traffic.

After I implemented these changes, we haven’t had a mass rate limit outage in 10 months, even during our busiest sprint planning meetings.

---

Monitoring: What Metrics Actually Matter (No Fluff)

After three failures, I learned that you don’t need a ton of fancy monitoring for a production MCP server, but you do need to track the right metrics. Most MCP servers are I/O bound, not CPU bound, so a lot of the generic server metrics you get by default aren’t that useful. Here’s the short list of metrics I track, and nothing more:

**95th percentile request latency per endpoint:** The most important thing for your users is how fast they get their context. The average latency hides slow requests, so I track the 95th percentile. If it goes over 2 seconds, I get an alert.
**Error rate:** I track the percentage of requests that return 4xx or 5xx errors. I set an alert if error rate goes over 5% for 5 minutes, which catches outages like rate limiting or connection timeouts before most users even report it.
**Memory usage (as % of limit):** After the memory leak incident, I always track this. I set an alert if it goes over 80% for 10 minutes, which gives me plenty of time to fix a leak before it causes an OOM kill.
**Cache hit ratio:** This tells you if your cache is sized correctly. If it drops below 70%, your cache is too small, and you’re getting too many expensive cache misses. If it’s over 90%, you can probably shrink your cache to free up memory.
**Open connection count:** This helps you catch connection leaks or misconfigured timeouts. If open connection count grows steadily over time, connections aren’t being closed properly, which will eventually lead to outages.
**Upstream API rate limit remaining:** Most APIs return your remaining rate limit in response headers. I track this, and set an alert if it drops below 10% before the reset time. That lets you adjust your concurrency limit or add another API key before you hit the limit.

I don’t track CPU usage at all, unless I see latency going up for no reason. For 99% of MCP servers, CPU usage stays under 10% even under load, so it’s just noise. The tradeoff here is that adding more metrics adds more overhead, and more false positive alerts that you have to ignore. I used to track 15 different metrics, and I got so many irrelevant alerts that I started ignoring them. Cutting it down to these 6 metrics eliminated all the noise, and I only get alerts when there’s actually a problem I need to fix.

---

My Current Production Setup + Docker Compose Config

After all these failures, I’ve settled on a production setup that’s been stable for 12 months, serving 25 team members, with less than 30 minutes of total downtime in that entire period. Here’s what it looks like:

I run two MCP servers (one for internal product docs, one for customer support context) as Docker containers on AWS ECS Fargate.
An Application Load Balancer sits in front, handles SSL termination and routing.
Each container has 1 vCPU and 2GB of RAM, with Nginx reverse proxy using the config I shared earlier.
A managed RDS Postgres database stores all doc embeddings and metadata.
LRU caching is enabled in each MCP server for frequent embeddings.
Access control: ALB only allows traffic from our company VPC, and the MCP server requires API key auth for all requests.
Auto-scaling adds more containers when connection count goes over a threshold, so we can handle 10x normal load during peak times.

For teams that want to run on a single VM instead of a managed container service, here’s the Docker Compose config I use for local testing and small VM deployments, it works out of the box:

```yaml

version: '3.8'

services:

mcp-server:

build: .

restart: always

ports:

"127.0.0.1:8080:8080"

environment:

OPENAI_API_KEY=${OPENAI_API_KEY}
NOTION_API_KEY=${NOTION_API_KEY}
MAX_CACHE_SIZE_MB=200

depends_on:

postgres

mem_limit: 1.5g

nginx:

image: nginx:alpine

restart: always

ports:

"80:80"

volumes:

./nginx.conf:/etc/nginx/conf.d/default.conf

depends_on:

mcp-server

mem_limit: 100m

postgres:

image: postgres:15-alpine

restart: always

volumes:

postgres-data:/var/lib/postgresql/data

environment:

POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
POSTGRES_USER=mcp
POSTGRES_DB=mcp

mem_limit: 400m

volumes:

postgres-data:

```

This config sets up automatic restart on crash, hard memory limits to prevent one service from starving the others, and all the dependencies you need to run a production MCP server on a single VM.

**Tradeoff:** My Fargate setup is about 3x more expensive than running everything on a single cheap VM, but it’s worth it for me. Fargate handles all maintenance, patching, backups, and high availability, so I don’t have to spend time keeping it running. If you’re running a non-critical MCP server for a small team, the Docker Compose on a cheap VM setup is totally fine and saves you a lot of money.

Official / Source Links

http://localhost:8080/

What To Do Next

Move from this guide to a concrete workflow and a matching tool page to apply the concepts.

Explore workflows Explore tools Explore topic hub

References

Last updated: April 5, 2026