Scaling MCP for Teams: Architecture Lessons
Overview
I started working with Model Control Plane (MCP) back in 2022 at a fintech, when it was just me and one other ML engineer building internal LLM tools for our business teams. We started with the standard single-user setup: I spun up a single g5.xlarge instance
Key Concepts
- • **Dependency and version chaos**: Each of our users was installing the MCP SDK locally or on their own personal instances, and we had three different major versions of the SDK in use within two weeks. When one user saved a new prompt version with the v1.2 schema, another user running v0.8 couldn’t parse it—leading to broken workflows and hours of debugging version mismatches. We also had different CUDA versions, different Python dependencies, and no consistent way to share working configurations between users.
- • **Uncontrolled resource contention**: Our single shared GPU instance had 24GB of VRAM, which was enough for two of us testing small prompts. When 12 users started running fine-tuning jobs and batch inference at the same time, we got constant out-of-memory errors and 10+ minute latency for any request. One user would kick off a 4-hour fine-tuning job, and everyone else got locked out until it finished. We tried adding a second instance, but that just created two silos of work with no way to share resources between them.
- • **No reproducibility or collaboration**: In our single-user setup, users saved model weights and prompts to whatever S3 bucket or local folder they had access to. When the lead data scientist for our credit risk team went on vacation, no one else could find the latest version of his fine-tuned model that was being tested for production. We also had no way to review prompt changes before they were used for business workloads, leading to multiple cases where outdated, low-accuracy prompts were used to generate business reports.
- • **Security and access gaps**: We were sharing the same IAM role for all users, which meant any user had full access to all models, prompts, and cloud resources. When we hired our first summer intern, we realized there was no way to give him read-only access to test prompts without giving him full access to all our confidential production assets.
- • **No visibility into cost or performance**: No one was tracking how much each user or team was spending on GPU resources, and our monthly cloud bill jumped from $800 to $11,000 in two months with no explanation. We couldn’t tell which workloads were driving cost, or whether our GPU resources were being used efficiently.
- • type: Pods
I started working with Model Control Plane (MCP) back in 2022 at a fintech, when it was just me and one other ML engineer building internal LLM tools for our business teams. We started with the standard single-user setup: I spun up a single g5.xlarge instance on AWS, checked out our MCP repo locally, added my IAM credentials to a .env file, and we were off. For 6 months, it worked perfectly. We tested prompts, fine-tuned our first custom model, and proved the business case for expanding the tool to the whole data science org. Then, in the span of 8 weeks, we went from 2 users to 22: 12 data scientists, 6 product managers, 3 ML engineers, and a handful of security auditors. That’s when our simple single-user setup broke in every way imaginable. Over the next 6 months, we rearchitected our MCP deployment to support 20+ active users across 5 different business teams, and learned more than a few hard lessons about what it takes to run MCP at enterprise scale. This post breaks down what worked, what didn’t, and the practical tradeoffs you’ll have to navigate when you outgrow your initial single-user setup.
Why Single-User MCP Setups Don’t Scale For Teams
When most teams start with MCP, they default to a single-user architecture: one or more users connecting directly to a single shared instance, with no abstraction layer between the user and the underlying infrastructure. Textually, this architecture looks like a simple hub-and-spoke: a single cloud instance (or local machine) at the center that holds all your model weights, prompt versions, and runtime, with each user connecting directly to that hub over SSH or a basic web interface. There’s no auth, no isolation, no orchestration—just a working MCP installation that you can spin up in an afternoon.
For 1-5 users, this setup is unbeatable. It has minimal operational overhead, you don’t have to worry about complex networking or access rules, and you can iterate quickly. But once you cross 10 users, every part of this architecture starts to crack. We hit five major pain points in our first month of scaling:
- **Dependency and version chaos**: Each of our users was installing the MCP SDK locally or on their own personal instances, and we had three different major versions of the SDK in use within two weeks. When one user saved a new prompt version with the v1.2 schema, another user running v0.8 couldn’t parse it—leading to broken workflows and hours of debugging version mismatches. We also had different CUDA versions, different Python dependencies, and no consistent way to share working configurations between users.
- **Uncontrolled resource contention**: Our single shared GPU instance had 24GB of VRAM, which was enough for two of us testing small prompts. When 12 users started running fine-tuning jobs and batch inference at the same time, we got constant out-of-memory errors and 10+ minute latency for any request. One user would kick off a 4-hour fine-tuning job, and everyone else got locked out until it finished. We tried adding a second instance, but that just created two silos of work with no way to share resources between them.
- **No reproducibility or collaboration**: In our single-user setup, users saved model weights and prompts to whatever S3 bucket or local folder they had access to. When the lead data scientist for our credit risk team went on vacation, no one else could find the latest version of his fine-tuned model that was being tested for production. We also had no way to review prompt changes before they were used for business workloads, leading to multiple cases where outdated, low-accuracy prompts were used to generate business reports.
- **Security and access gaps**: We were sharing the same IAM role for all users, which meant any user had full access to all models, prompts, and cloud resources. When we hired our first summer intern, we realized there was no way to give him read-only access to test prompts without giving him full access to all our confidential production assets.
- **No visibility into cost or performance**: No one was tracking how much each user or team was spending on GPU resources, and our monthly cloud bill jumped from $800 to $11,000 in two months with no explanation. We couldn’t tell which workloads were driving cost, or whether our GPU resources were being used efficiently.
None of these problems are impossible to fix, but they can’t be fixed by just adding more GPUs to your single shared instance. You need a fundamentally different architecture to support team-scale MCP.
Shared Server Architecture: Our Production Design
After a week of brainstorming and testing different patterns, we landed on a four-layer shared MCP architecture that’s scaled reliably for 20+ users for over a year now. Textually, this architecture is structured as follows, from the user edge down to the data layer:
```
[User Devices (SSO-authenticated)]
↓
[Edge Layer: Routing + Auth Middleware]
↓
[Control Plane: Stateless Orchestration, Prompt Registry, Model Registry]
↓
[Autoscaled Worker Pool: Isolated GPU/CPU Workers, per-workload resource limits]
↓
[Data Layer: Encrypted Object Store (model/prompt data) + Relational Metadata DB (access, versioning)]
```
This design separates concerns so you can scale each layer independently, which is the biggest improvement over the single-user hub-and-spoke. We chose a distributed control plane/worker architecture over a monolithic shared server, after testing both options. The monolithic approach puts control plane and worker workloads on the same server, which is faster to stand up, but it has two big downsides: it’s a single point of failure, and you can’t scale worker resources independently of the control plane. If you need more GPU for a batch of fine-tuning jobs, you have to scale the whole monolith, which wastes money on extra CPU you don’t need for the control plane.
Our control plane is completely stateless, runs on 2 cheap t3.large CPU instances behind a load balancer, and handles all the non-GPU work: user authentication, prompt versioning, model registry management, workload scheduling, and access control. All the GPU-intensive work (fine-tuning, batch inference, real-time inference testing) is offloaded to workers in a separate worker pool. Workers are spun up on demand when there’s work to do, and terminated when they’re idle.
To handle autoscaling of the worker pool, we run everything on Kubernetes, and use the Horizontal Pod Autoscaler to scale workers based on queue length of pending workloads. This is the HPA config we use, which you can drop into any Kubernetes cluster to get the same behavior:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-gpu-worker
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-gpu-worker
minReplicas: 2
maxReplicas: 15
metrics:
- type: Pods
pods:
metric:
name: k8s_pod_queue_length
target:
type: AverageValue
averageValue: 5
behavior:
scaleDown:
stabilizationWindowSeconds: 300
```
The tradeoff here is intentional: we set minReplicas to 2 to handle sudden traffic spikes, maxReplicas to 15 to cap our maximum monthly cost, and a 5-minute stabilization window for scale down to avoid thrashing (creating and terminating pods too frequently). We tested scaling based on GPU utilization first, but found that queue length was a much better metric for MCP workloads, because fine-tuning jobs can run for hours with steady high GPU utilization, so queue length gives you an earlier signal that you need more workers to keep wait times low.
All requests flow through our edge layer, which handles routing and authenticates users before they hit the control plane. We added tenant isolation middleware to ensure every request is tagged with the user’s team and organization ID, so all downstream operations can enforce access control. This is the runnable FastAPI middleware we use for this:
```python
from fastapi import Request, HTTPException
from typing import Callable
import jwt
JWT_PUBLIC_KEY = "-----BEGIN PUBLIC KEY-----\nYOUR_ORG_PUBLIC_KEY_HERE\n-----END PUBLIC KEY-----\n"
async def tenant_isolation_middleware(request: Request, call_next: Callable):
auth_header = request.headers.get("Authorization")
if not auth_header or not auth_header.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Missing valid token")
token = auth_header.split(" ")[1]
try:
payload = jwt.decode(token, JWT_PUBLIC_KEY, algorithms=["RS256"])
except jwt.InvalidTokenError:
raise HTTPException(status_code=403, detail="Invalid or expired token")
request.state.tenant_id = payload.get("org_id")
request.state.user_id = payload.get("user_id")
response = await call_next(request)
return response
```
This middleware pulls user and tenant context directly from the SSO-issued JWT token, so we never have to trust user input for access control. Which brings me to access control, and the biggest gotcha I hit during this entire project.
Access Control Patterns: Lessons From A 2AM Outage
Access control is the most overlooked part of scaling MCP for teams, and it’s where we made our most expensive mistake. When we first rearchitected, we knew we needed to add multi-tenant access control, so we added tenant ID filtering to our create and update endpoints. We tested it: created a test prompt as user A, tried to access it as user B, got a 403, and called it done. That was a huge mistake.
Two weeks after we launched the new architecture, I got a 2 a.m. Slack message from our head of risk: a junior analyst on the retail product team had been searching for prompt templates in the MCP dashboard, and got back a full list of confidential prompts from the credit risk team, including prompts trained on PII customer financial data. I jumped out of bed, pulled up the code, and found the problem immediately: we’d added tenant ID filtering to write operations, but forgotten to add it to our list and get endpoints for prompts and models. The middleware was correctly injecting the tenant ID into the request state, but we never added the `WHERE tenant_id = $1` clause to the metadata queries for reading existing resources.
That was a 2-hour outage while we revoked all user access, patched the code, and ran a full audit to confirm no other data was exposed. It was embarrassing, it took a full week of follow-up with the security team, and it taught me a hard lesson: access control isn’t something you bolt on to a few endpoints—it has to be baked into every read and write operation at the data layer, not just the API layer.
After that incident, we settled on two access control patterns that work for our 20+ user org, with clear tradeoffs:
- **Hybrid RBAC-ABAC for permissions**: For base permissions (who can create fine-tuning jobs, who can deploy to production), RBAC is simple and easy to manage. We have four core roles: Viewer (read-only access to approved resources), Contributor (can create and test workloads in non-production), Owner (can edit team resources), and Admin (full access). For cross-team resource access, we add ABAC with attributes for team, model sensitivity, and environment. A common rule we use is: *"A user can access a model only if their team matches the model’s team attribute, or the model is marked as organization-shared"*. This gives us fine-grained control without the overhead of managing hundreds of individual roles. The tradeoff is that ABAC adds a small amount of complexity to your authorization logic, but it’s well worth it once you have more than two teams using the platform.
- **Hybrid multi-tenancy for infrastructure**: For non-production workloads (which make up 80% of what our 20+ users do daily), we use soft multi-tenancy: shared infrastructure, with row-level isolation in the metadata database. This is much cheaper than running separate infrastructure per team, and it’s secure enough for non-production as long as your access control is correctly implemented at the data layer. For production workloads that handle PII or confidential business data, we use hard multi-tenancy: separate worker nodes and separate databases for each business unit’s production workloads. This adds some operational overhead, but it meets our security requirements and reduces risk of cross-tenant data exposure. The tradeoff is that hard multi-tenancy costs about 2x as much as soft multi-tenancy for production, but it’s a small price to pay for compliance with our data security policies.
Monitoring and Logging For Multi-User MCP
When you have a single user, debugging a broken workload is easy: you just check the local logs. When you have 20+ users running dozens of concurrent workloads on ephemeral workers, you can’t debug without centralized monitoring and logging. We learned this the hard way when we spent 4 hours tracking down a broken fine-tuning job that failed because a user uploaded a corrupted dataset, and we had no logs of the error because the ephemeral worker got terminated immediately after the failure.
We ended up implementing three core monitoring and logging practices that work for us:
- **End-to-end distributed tracing**: Every request that comes into the edge layer gets a unique trace ID, which is propagated all the way through the control plane to the worker and back. This lets us see exactly where a request failed: whether it was a timeout at the edge, an authorization error in the control plane, or an out-of-memory error on the worker. We use OpenTelemetry to collect traces and send them to our existing observability stack, so we don’t have to manage any new tools. The tradeoff is that adding tracing adds about 2-5ms of latency per request, which is negligible for almost all MCP workloads, and the debug time savings more than make up for it.
- **Centralized PII-redacted logging**: We collect all logs from the control plane and workers into a centralized logging service, but we automatically redact any PII from prompts and model inputs before they’re saved to logs. This meets our compliance requirements, and still gives us enough context to debug errors. We don’t redact error messages or infrastructure logs, just user-provided prompt content. The tradeoff is that redaction adds a small amount of processing overhead, but it eliminates the risk of storing PII in logs, which is a non-negotiable for our enterprise compliance.
- **Two tiers of alerting**: We have infrastructure alerts (for out of memory, worker failures, control plane downtime) and business alerts (for anomalous cost spikes, unexpected drops in model throughput, increases in error rate). We used to only have infrastructure alerts, until we had a fine-tuning job left running over a 3-day holiday weekend, racking up $1,200 in unnecessary GPU costs. Now we have an alert that triggers if any 24-hour period has cost 2x higher than the 30-day average for that team, which lets us catch idle or runaway jobs early.
Cost Management At Scale
Cost is one of the biggest surprises when you scale MCP from 2 users to 20 users. Our monthly cloud bill went from $800 when we had 2 users to $14,000 after we hit 22 users, and 40% of that cost was going to idle resources and unallocated workloads that no one was using. After we implemented these four cost management practices, we cut our monthly bill to $9,000 without impacting user experience:
- **Mandatory workload tagging**: Every workload (fine-tuning, batch inference, notebook, deployment) has to have a user ID, team ID, and environment tag before it can be scheduled. We enforce this at the API level: any workload without valid tags is rejected immediately. This lets us allocate 100% of our MCP costs back to individual teams, which gives teams incentive to clean up their idle resources, because their budget is on the hook. The tradeoff is that mandatory tagging adds a small amount of friction for users, who have to fill out an extra field when they start a workload. But we found that after the first month, users got used to it, and the cost transparency more than makes up for the small friction.
- **Tiered instance pricing by environment**: We use 100% spot instances for all non-production workloads. Spot instances are 60-70% cheaper than on-demand instances, and we added automatic checkpointing every 5 minutes for all fine-tuning and batch jobs, so if a spot instance gets terminated, the job automatically resumes from the last checkpoint on a new instance. For our steady-state production inference workloads, we use a mix of 3-year reserved instances (for 40% savings over on-demand) and on-demand instances for spiky traffic. This balance gives us the best of both worlds: low cost for non-production, and reliability for production. The tradeoff is that spot instances can be terminated, so you have to build in checkpointing and job resumption, which takes a little bit of engineering work. But for 99% of non-production workloads, a 5-minute delay from a termination is totally acceptable, and the cost savings are massive.
- **Automatic idle resource cleanup**: We run a cron job every 2 hours that scans for idle resources: workers that haven’t processed a request in more than 2 hours, notebook instances that are running but no one has logged in for 12 hours, and stopped workloads that haven’t been cleaned up. It automatically terminates any idle resources that meet the criteria, and sends a notification to the user who created the workload. This alone cut our monthly cost by 28% when we implemented it. The tradeoff is that we’ve had a couple of cases where a long-running test job was incorrectly marked as idle and terminated, but we added an exception flag that users can set to keep a resource running, so that solves the problem.
- **Resource quotas per team**: We set soft and hard GPU quotas for each team, based on their budget. If a team hits 80% of their monthly quota, we send a warning. If they hit 100%, we block new workloads from being scheduled until they request a quota increase. This prevents any single team from accidentally burning through the entire monthly MCP budget with a few runaway jobs. The tradeoff is that quotas can block user work if a team forgets to request an increase for a large project, but we’ve found that the warning system gives teams enough time to adjust, and it’s a small inconvenience compared to the risk of massive overspending.
Onboarding New Team Members At Scale
When you have a single-user setup, onboarding is just "send the user the IAM credentials and the repo link". When you have 20+ users, with new people joining every month, you need a scalable onboarding process that doesn’t require the MCP admin team to spend 4 hours per new user. We implemented three changes that cut our average onboarding time from 3.5 hours per new user to 20 minutes:
- **SSO-first access, no shared API keys**: We integrated MCP with our corporate SSO provider, so any user with an active company account can log into the MCP dashboard with one click. Access is automatically revoked when a user leaves the company or changes teams, which eliminates the security risk of stale API keys floating around in Slack DMs. We do issue long-lived API keys for automated workloads (like CI/CD pipelines for prompts), but those are rotated automatically every 90 days, and they’re tied to a service account that belongs to a specific team. The tradeoff is that SSO integration takes about a week of engineering work to set up correctly, especially if you have to align with your company’s existing security policies. But it eliminates 90% of onboarding/offboarding toil, which saves the MCP team multiple hours a month.
- **Pre-built workload templates for common use cases**: We curate a library of pre-approved, pre-configured workload templates for the most common use cases our users need: RAG prompt tuning, Llama 3 fine-tuning, text classification batch inference, etc. Each template has all the dependencies pre-configured, includes sample prompts and documentation, and can be spun up with one click. A new user doesn’t have to spend 3 days figuring out how to configure the MCP SDK, set up the right IAM permissions, and install the correct CUDA dependencies—they just click "Use Template" and start working. We found that 70% of our users use these templates for their daily work, which cuts down on configuration errors and version mismatches. The tradeoff is that we have to spend a few hours a month updating the templates when we release new versions of the MCP SDK or new base models. But that’s a fixed cost that benefits all new users, so it’s well worth it.
- **Shared, documented registry for prompts and models**: Instead of having users save their prompts and models in personal silos, we have a shared organization-level registry that all approved prompts and models are added to. Every entry in the registry has a description, owner, performance metrics, and changelog, so new users can see exactly what a prompt or model does, who to contact with questions, and how it’s changed over time. This lets new users build on existing work instead of starting from scratch, which speeds up their first project by days. The tradeoff is that we require users to fill out the metadata for any entry they add to the shared registry, which adds a little bit of friction. But most users appreciate that their own work is easier to find and reuse by other team members, so it’s a net positive.
Key Lessons From Scaling To 20+ Users
After 18 months of running our multi-user MCP deployment, supporting 20+ active users across 5 teams, we’ve learned a handful of key lessons that we wish we’d known when we started:
First, don’t overbuild early, but bake in scalability fundamentals from day one. We were right to start with a simple single-user setup to prove the business case before investing in a full rearchitecture. But we were wrong to put off adding basic things like tagging and access control until after we already had 20 users. If we’d added basic tenant ID and tagging to our data model from the start, we wouldn’t have had the data exposure incident that we had. You don’t need to build a full distributed architecture for 5 users, but you do need to design your data model to support multi-tenancy and access control from day one.
Second, prioritize user productivity over perfect architecture. We spent two weeks debating whether we should build full hard multi-tenancy for all workloads, including non-production, before we launched. We eventually decided to stick with soft multi-tenancy for non-production, which was good enough, and saved us two weeks of engineering work that we could spend on improving user experience. The goal of a team MCP is to let your users build better LLM workflows faster, not to have a perfect architecture that meets every possible edge case up front.
Third, the vast majority of your cost is in workers, not control plane. We spent a lot of time optimizing control plane cost early on, and realized that the control plane only makes up about 5% of our total monthly MCP cost. 95% of the cost is in GPU workers, so that’s where you should focus all your cost optimization effort. It doesn’t matter if you save 50% on control plane cost if you’re wasting thousands on idle GPU workers.
Fourth, access control is only as good as your testing. After our data exposure incident, we now do a quarterly access audit where we create two test users from different teams, and test that user A can’t access user B’s resources, that roles are enforced correctly, and that all endpoints (read, write, list, delete) have access control applied. We’ve caught three missing access control filters in these audits over the past year, which would have led to more data exposure if we hadn’t found them. Never assume you got access control right the first time—test it regularly.
Actionable Next Steps
If you’re currently running a single-user MCP setup and are getting ready to scale to 10+ users, here’s what you should do next, in order:
- This week, audit your current architecture and document all the pain points you’re already hitting: version mismatches, resource contention, unexpected cost spikes, or access gaps. Prioritize the top three pain points to address first.
- Add mandatory workload tagging to all new workloads, even if you’re still running a small setup. This costs very little engineering time, and it will give you cost visibility early, and make it much easier to add access control and multi-tenancy later.
- Test your current access control: create two test users from different teams, and confirm that a user from team A can’t access confidential resources from team B. If you find gaps, fix them before you add more users.
- Implement automatic idle resource cleanup, even for small setups. This will cut your cost immediately, and it’s a one-time engineering investment that pays for itself in the first month.
- If you’re already at 10+ users and still on a single-user setup, schedule a half-day workshop to map out your four-layer shared architecture (edge
What To Do Next
Move from this guide to a concrete workflow and a matching tool page to apply the concepts.
References
- Model Context Protocol (MCP) — Official Documentation
- MCP Specification & Quick Start
- MCP GitHub Organization
Last updated: April 5, 2026