Harnessing the power of AI to meet your business goals is an appealing proposition, but also a very costly one. LLMs, in particular, consume compute resources like kids with Halloween candy. Right now, AI is heavily subsidized, and cost management is still a major challenge. If you’re going to continue making profitable use of AI in the future, you need to figure out how to optimize your implementation now.
You’ve likely heard of Anthropic’s Model Context Protocol (MCP). It’s one of many strategies for managing the cost and improving the performance of generative AI. An MCP server can make your LLM integrations more productive, especially if you’re looking at building an autonomous AI agent. But everything to do with generative AI is new, and it can be hard to forecast the long-term impacts of technical choices.
One often-overlooked factor in LLM costs is token processing. Tokens are the basic building material of LLMs, but often don’t come up in conversations about cost and performance optimization. In this article, we’ll look deeper at how tokens work, how understanding tokens can help you understand MCP servers, and how that knowledge can help you get better results from your AI implementation.
What is an MCP Server?
An MCP server is a system that provides structured, on-demand access to tools and data for large language models. It helps the model retrieve only the information it needs, when it needs it, rather than loading everything into the prompt up front. This allows developers to extend an LLM’s capabilities without overloading its context window, reducing token usage and improving performance. Think of it as a smart interface between an LLM and external resources like APIs, databases, or file systems.
How do LLM tokens work?
Tokens are the building blocks that large language models (LLMs) use to understand and generate text. Instead of processing whole words or sentences, LLMs break input text into smaller units called tokens, which often represent word parts. These tokens are converted into numbers the model can process, allowing it to analyze input and generate coherent responses. Understanding how tokens work is key to managing cost and performance when working with LLMs.
Tokens represent word parts, when you enter a text prompt into an LLM like ChatGPT or Claude, it converts that text into tokens. Tiktokenizer is an app that shows you how LLM prompts are converted into tokens and what the models see under the hood:

Tokens go through several processes that help the model understand what you’re asking it to do. The model generates a response in the form of a new string of correlated tokens. Once that’s complete, the model decodes the token string, outputting them as human-readable text.
The cost of tokens adds up
Most LLMs have pricing structures based on token usage, with the most common being prompt length (input tokens) and generated text length (output tokens). The LLM examines and references these tokens through a context window, which serves as the model's working memory.
A larger context window can handle more tokens simultaneously, which allows more complex prompts and outputs. A smaller context window may prevent the model from handling longer prompts or maintaining consistency over expanded conversations. That’s why more reliable models tend to consume more resources. It also translates to higher costs.
Every interaction your application has with an LLM can lead to significant token usage. Understanding the role of tokens within these models and how the context window works helps you manage your token costs and improve the performance of your AI-powered application.
A closer look at managing context
The “C” in MCP stands for “Context” – the purpose of MCP servers is to make better use of the context window, and, ideally, to reduce token costs. To set up an effective MCP server, you need to understand the context window.
Nearly every LLM measures this working memory in tokens, and the total available context window varies for each LLM. But just because a context window has a maximum number of tokens doesn’t mean you need to hit that maximum, and it may be possible to get high-quality results with less input.
Max Context Windows for Popular LLM
The way you design and implement an MCP server impacts what goes in and out of that context window. When you configure an MCP server, you provide a set of resources that can include API response formats, database records, image files, event logs, and more. If you’re not thoughtful about what you include, you may wind up feeding unnecessary or redundant data to the LLM.
The more data and tools you expose to an LLM via your MCP, the higher the number of tokens the model processes in its working memory. An excessive number of tokens in the context window can cause performance and latency issues in LLMs.
It can be tempting to just dump everything into the server in hopes that it will translate to more accurate outputs — but keep in mind that every token you input through an MCP has a cost, and the tiny improvements in accuracy may not be worth what you spend.
Real-world lessons in MCP token limits
A growing number of developers are building prototype MCP servers to learn about Anthropic’s model context protocol. They’ve uncovered critical lessons about how implementation impacts LLM costs and performance.
Optimizing an MCP server to reduce LLM token costs
Senior engineer Craig Walls built an experimental custom MCP server using the ThemeParks.wiki API and initially a GPT-4o model. The first attempt involved exposing each API endpoint as a tool. Sometimes the server worked, but other times it failed because it exceeded a tokens-per-minute (TPM) rate limit from OpenAI.
He concluded that the “ThemeParks.wiki API isn’t designed to be optimally used directly as the underpinning of an MCP server” and requires optimization for this purpose. Because he doesn’t control the API’s design, he optimized how the server uses the API instead. He later switched to the GPT-4o-mini, which cost less than the other model and had a much higher TPM limit.
This MCP experiment shows that the data returned by an API can impact model token usage, which could lead to a huge LLM token bill. The changes Walls made to the implementation resulted in the server working faster and remaining below the maximum TPM.
Restricting tools availability to reduce documentation-related token usage
The team at Unstructured started with a prototype MCP server, which they later developed into an official Unstructured MCP Server implementation for interacting with the Unstructured API. Their server offers 20+ tools to list sources and workflows and supports multiple connectors. It has also been integrated with Firecrawl, a web crawler that outputs data optimized for LLMs.
During development, the team realized that it would not be a good idea to match API functionality to MCP tools:
- Too many available tools can “overwhelm and confuse” the LLM, making it difficult for the model to find the exact tool for the immediate task.
- The documentation for each tool could reduce or deplete the available context window space that the LLM can use.
Restricting the number of tools the MCP server offers makes it easier for the LLM to find the tools needed for each task, and reduces documentation-related token usage. The team also abstracted all the connector management functionality so the MCP did not need unique details for each source and destination connector. That “resulted in a slashing of the context window usage by 5000 tokens.”
These examples illustrate several methods to minimize token usage and enhance MCP server performance. These developers built their MCP servers from scratch based on an existing API. However, there are other ways to build MCP servers.
Best practices when building an MCP server
If you’re going to build your own MCP server, you know you’ll need to manage token usage to get the best server performance and cost efficiency. A few best practices can help. The examples above hold a few lessons:
- Choose an LLM with a context window that’s large enough to meet your MCP server’s operational requirements. Maximum volumes vary by as much as 10x.
- If you control the API format, optimize it to work with your MCP server. If you can’t, optimize how the server uses the API. Ideally, do both.
- Ensure the MCP server provides the right volume of data to the LLM – be cautious not to supply it with too many tools or redundant data.
Below are some additional best practices to better manage LLM token costs and performance. Most of these recommendations apply whether you use an automated tool or an LLM to generate an MCP server, leverage an SDK, or build one from scratch.
How to build an MCP server from Github in minutes with Blackbird
With Blackbird’s hosted MCP server capability, you can skip the messy local setup and spin up a fully functional server in the cloud in just minutes. No more restarting every 30 minutes, wrestling with dependencies, or melting your laptop. Whether you’re sharing endpoints with teammates, integrating into CI/CD pipelines, or just trying to get something working fast, Blackbird lets you jump straight to productivity—no configuration headaches required. From GitHub to AWS KB and Slack, our catalog of ready-to-deploy MCP servers is growing fast. Go here to create one.
Make sure the OpenAPI specification is complete
If you can utilize OpenAPI specs, do so. The format has already been optimized to eliminate redundancy and improve clarity. If you need to supply the LLM with info about an API, the OpenAPI spec is already a complete and precise standard. The specification should include descriptions that clearly demonstrate the purpose of each relevant endpoint. Example responses and request parameters are especially valuable because they help reduce wasted tokens.
Prune as many generated tools as possible
The more tools an LLM has to choose from, the more tokens it will need to decide which of those tools to use. If you offer the LLM fewer tools it will need fewer tokens. Remove all the unneeded MCP tools automatically generated from an OpenAPI spec. Only keep useful operations and capabilities that the LLM might need to perform. All remaining MCP tools have descriptions that specify when an LLM should use them, including examples.
Filter out irrelevant data & endpoints
No one makes use of every feature of every API, and your LLM integrations won’t need to, either. You can reduce token usage by optimizing the information your MCP provides about APIs. Think carefully about which endpoints the LLM actually needs to understand, and exclude the rest. Even useful responses can contain a lot of extraneous JSON text, especially GraphQL APIs. Set up your MCP server to keep irrelevant data and unnecessary tokens out of the LLM context window, whether from unused endpoints or extraneous JSON.
Craig Walls filtered the responses from the ThemeParks.wiki API to remove unwanted data and reduce the amount of JSON sent to the LLM prompt context window. These actions decreased the number of prompt tokens by roughly 93%-98%.
Implement caching techniques
Caching reduces the number of tokens needed for MCP servers and AI agents to complete operations. It can also improve LLM performance. Consider implementing multiple techniques, such as prompt caching, response caching, semantic caching, and MCP in-memory caching. But note that adding more context (for APIs or tools) may lead to higher token usage despite caching.
Use an LLM observability tool
LLM observability tools primarily provide continuous monitoring and analytics for LLMs. Most include dashboards that provide real-time insights into various aspects of AI models, such as token usage, model latency, errors, and user activity.
You can't reduce your MCP server's LLM token usage if you don't know how many tokens are being generated and why. An LLM observability is critical to figuring that out.
Implement an AI gateway
Like LLM observability tools, AI gateways typically provide prebuilt dashboards and AI-specific analytics. However, they also offer other capabilities, such as response caching, semantic caching, and automatic request routing. All these capabilities improve the performance of the MCP server and help you save on LLM tokens.
These are just a few of the best practices to follow when building an MCP server. You should also mind the security gaps in MCP, and implement the recommended security mechanisms.
Always optimize for LLM tokens
Understanding the role of tokens in LLMs is critical to getting the most out of AI integrations. When you connect an MCP server to an LLM, you’re signing on to generate a larger volume of tokens, which can rapidly spike your costs. It can also radically improve the quality of the responses you get, so managing those tokens is a crucial factor in getting value out of your LLM integrations.
Design and optimize your MCP server implementation with LLM tokens in mind. Apply best practices so the AI model generates fewer tokens and works efficiently. Ideally, you’ll find that providing a moderate amount of precise, detailed information yields higher quality results at lower costs.