Cody Input and Output Token Limits

For all models, Cody allows up to 4,000 tokens of output, which is approximately 500-600 lines of code.

For Claude 3 Sonnet or Opus models, Cody tracks two separate token limits:

@-mention context is limited to 30,000 tokens (~4,000 lines of code) and can be specified using the @-filename syntax. This context is explicitly defined by the user and is used to provide specific information to Cody.
Conversation context is limited to 15,000 tokens and includes user questions, system responses, and automatically retrieved context items. Apart from user questions, this context is generated automatically by Cody.

All other models are currently capped at 7,000 tokens of shared context between @-mention context and chat history.

Here's a detailed breakdown of the token limits by model:

Model	Conversation Context	@-mention Context	Output
gpt-3.5-turbo	7,000	shared	4,000
gpt-4-turbo	7,000	shared	4,000
gpt 4o	7,000	shared	4,000
claude-2.0	7,000	shared	4,000
claude-2.1	7,000	shared	4,000
claude-3 Haiku	7,000	shared	4,000
claude-3 Sonnet	15,000	30,000	4,000
claude-3.5 Sonnet	15,000	30,000	4,000
mixtral 8x7B	7,000	shared	4,000
mixtral 8x22B	7,000	shared	4,000
Google Gemini 1.5 Flash	7,000	shared	4,000
Google Gemini 1.5 Pro	7,000	shared	4,000

For more information on how Cody builds context, see our documentation here.

What is a Context Window?

A context window in large language models refers to the maximum number of tokens (words or subwords) that the model can process at once. This window determines how much context the model can consider when generating text or code.

Context windows exist due to computational limitations and memory constraints. Large language models have billions of parameters, and processing extremely long sequences of text can quickly become computationally expensive and memory-intensive. By limiting the context window, the model can operate more efficiently and make predictions in a reasonable amount of time.

What is an Output Limit?

Output Limit refers to the maximum number of tokens that a large language model can generate in a single response. This limit is typically set to ensure that the model's output remains manageable and relevant to the given context.

When a model generates text or code, it does so token by token, predicting the most likely next token based on the input context and its learned patterns. The output limit determines when the model should stop generating further tokens, even if it could potentially continue.

The output limit helps to keep the generated text focused, concise, and manageable by preventing the model from going off-topic or generating excessively long responses, ensuring that the output can be efficiently processed and displayed by downstream applications or user interfaces while managing computational resources.

Current foundation model limits

Here is a table with the context window sizes and ouput limits for each of our supported models.

Model	Context Window	Output Limit
gpt-3.5-turbo	16,385 tokens	4,096 tokens
gpt-4	8,192 tokens	4,096 tokens
gpt-4-turbo	128,000 tokens	4,096 tokens
claude instant	100,000 tokens	4,096 tokens
claude-2.0	100,000 tokens	4,096 tokens
claude-2.1	200,000 tokens	4,096 tokens
claude-3 Haiku	200,000 tokens	4,096 tokens
claude-3 Sonnet	200,000 tokens	4,096 tokens
claude-3 Opus	200,000 tokens	4,096 tokens
mixtral 8x7b	32,000 tokens	4,096 tokens

These foundation model limits are the inherent limits of the LLM models themselves. For instance, Claude 3 models have a 200K context window compared to 8,192 for GPT-4.

Tradeoffs: Size, Accuracy, Latency and Cost

So why doesn't Cody use the full available context window for each model? There are a few tradeoffs that we need to consider, namely, context size, retrieval accuracy, latency and costs.

Context Size: A larger context window allows Cody to consider more information, potentially leading to more coherent and relevant outputs. However, in RAG based systems like Cody, the value of increasing the context window is related to the precision and recall of the underlying retrieval mechanism. If the relevant files can be retrieved with high precision and added to an existing context window, expansion may not actually increase response quality. Conversely, some queries require a vast array of documents to syntehsize the best possible response, so an increase in context window would be beneficial. We work to balance these nuances against the latency and cost tradeoffs that come with increased input token lengths.
Retrieval Accuracy: Not all context windows are created equal. Research shows that an LLM's ability to retrieve a fact from a context window can degrade dramatically as the size of the context window increases. This means that it is important to put the relevant information into as few tokens as possible, so as not to confuse the underlying LLM. As foundation models continue to improve, we are seeing increased within context retrieval meaning that large context windows are becoming more viable. We are excited to bring these improvements to Cody.
Latency: With a larger context window, the model needs to process more data, which can increase the latency or response time. This is often experienced by the end user as "time to first token" or how long the user waits until they see an output start to stream. In some cases longer latency is a worthy tradeoff for higher accuracy, but our research shows that this is very use case and user dependent.
Computational Cost: Finally, the costs of processing large context windows scale linearly with the context window size. In order to provide a high quality response at a reasonable cost to the user, Cody leverages our expertise in code based RAG to drive down the generation costs while maintaining output quality.