2 Context Engineering

LLMs are trending towards having longer context windows, capable of processing upwards of millions of tokens. This is regarded as beneficial overall, as larger contexts allow a model to process more information and solve more complex problems. However, research from Chroma showed that model performance degrades as context length increases, a phenomenom their research team coined “context rot” (Hong, Troynikov, and Huber 2025). Specifically, they observed this phenomenon by evaluating LLMs on the Needle in a Haystack (NIAH) task, where the LLM is instructed to answer a question where the answer (i.e. needle) is embedded in a larger, unrelated body of text (i.e. haystack). Traditionally, the needle for a question can be identified via lexical matching, or exact matching on words or phrases. For example:

Question: Which book sparked my interest in AI?

Needle: The book which sparked my interest in AI is “The Worlds I see”.

A needle can also be identified through semantic matching or via a LLM’s world knowledge that do not involve lexcial matching. For example:

Question: Which professor had prior experience in theoretical research?

Needle: Ten years ago, Ms. Carter spent a decade at the Institute for Advanced Study.

In the above example, in order to identify the needle, the LLM had to utilize its world knowledge that the Institute for Advanced Study is a center for theoretical research as well as semantic identification of the needle’s relevance to the question. There are no overlapping words between the question and needle in this case.

Using the NIAH task, the Chroma team carried out a series of controlled experiments to show that performance generally degrades with longer contexts. The key findings are:

When the task complexity is held constant by keeping the question-needle embedding cosine similarity the same, model performance degrades with context length. The performance decline tends to be more rapid for question-needle pairs that are more dissimilar.
The decline in performance is also influenced by distractors (i.e. text chunks that are topically related to the needle but does not answer the question), the content of the haystack, and the structural coherency of the haystack (i.e. randomly shuffled sentences or not).
No evidence that needle position in the haystack influences performance.

For a broader discussion on the types of long context problems, such poisoning, distraction, confusion, or clash, refer to this blog post by Drew Breunig.

Together, these findings suggest that managing the context of the LLM is important for effectiveness. Specifically, that the LLM’s context window should be filled with relevant content for a given task, but no more, in order to be maximally effective. This context management effort is now coined the term “context engineering”, commonly defined as “the art of providing all the context for the task to be plausibly solvable by the LLM”. Context engineering regards everything that is inputted into a LLM as context, which includes memory, prompts, information from retrieved for RAG, etc.

A langchain blog post summarizes common context engineering patterns, which include writing context to offload information for later use, information selection for the context window, compression, and isolation. A few concrete tactics include performing RAG on tool descriptions to shrink the tool-selection space to a smaller and more relevant set of tools, periodically summarizing past conversations (implemented by Claude Code and ChatDev), and multi-agent architectures where subagents own their own isolated context so that the overall agentic system is effectively using an expanded context window.

Context compression is a key tactic used by ChatDev to implement multi-agent collaboration for software development (Qian et al. 2023), where the conversation history between two subagents for each software development phase is summarized into the solution. This solution then serves as the start context for the next phase of subagent-to-subagent dialog to tackle another set of tasks. This intuitively makes sense because only the solutions of subproblems are necessary for solving the bigger problem, and the process for solving the subproblems is usually irrelevant.

References

Hong, Kelly, Anton Troynikov, and Jeff Huber. 2025. “Context Rot: How Increasing Input Tokens Impacts LLM Performance.” Chroma. https://research.trychroma.com/context-rot.

Qian, Chen, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, et al. 2023. “Chatdev: Communicative Agents for Software Development.” arXiv Preprint arXiv:2307.07924.