MK1 - Blog

Summary

Problem: As document sizes grow, even state-of-the-art LLMs struggle with comprehension, leading to mistakes and decreased accuracy for Q&A tasks.
Diagnosis: We pinpoint these deficiencies with a new Q&A benchmark, which allows us to independently control document length and task complexity.
Solution: We present a new approach—leveraging our custom LLM (CAMeL) to augment the memory of a frontier model—that improves Q&A accuracy for large documents.

The Challenge of Document Comprehension

Long-context large language models (LLMs) have opened new opportunities for question-answering tasks on documents spanning hundreds of thousands of tokens or more. This capability is especially promising in high-stakes fields like finance, law, and healthcare, where document understanding can significantly enhance domain knowledge, improve answer accuracy, and reduce hallucinations. Yet research indicates that as documents grow longer and more complex, even state-of-the-art models struggle to correctly apply the information provided. This raises a natural question: how do we systematically identify and isolate these failure modes, so we can better understand how different models and methods perform—and ultimately address their underlying limitations?

To tackle this challenge, we introduce a new benchmark designed to capture the complexity of real-world documents. Such documents often involve intricate concept definitions and webs of cross-references, both explicit and implicit (see Fig. 1). We posit that for an AI system to function effectively in these scenarios, it must navigate these interconnections and accurately apply its understanding when answering user queries. We refer to this critical capability as comprehension.

Fig. 1 Real-world documents typically contain complex definitions of concepts and a web of cross-references.

DNF2TEXT: A Benchmark for Comprehension

To evaluate how well models can interpret complex documents, we developed DNF2TEXT (Disjunctive Normal Form to Text). This benchmark generates documents with logically coherent, natural-language clauses whose truth depends on a tree of interconnected conditions. For example, one clause may only be true if certain other clauses are satisfied, each of which may themselves depend on further conditions.

By adjusting the number of clauses, number of conditions per clause, and the depth of the dependency tree, we can control document complexity. A key advantage of our approach is that we can extend context length without altering the fundamental structure simply by adding irrelevant clauses that are logically distinct from the core content.

Each document is designed to be logically consistent, allowing us to systematically generate queries with verifiable true or false answers. Crucially, we can set the query depth, defining how many nested clauses must be traversed to reach the correct answer (see Fig. 2). For a query depth of one, all the assigned conditions and evaluated conditions are within the same clause. Thus, in our experiments we focus on a query depth of two to ensure there is one level of indirection.

Fig. 2 In the DNF2TEXT benchmark, the context is a natural language representation of a hierarchical set of logical expressions. To answer a query correctly, a model must determine if a statement is true, false, or undetermined based on the context and the provided conditions.

Comprehension as Context Length Increases

To gauge how frontier models handle increasing document sizes, we synthesized a DNF2TEXT dataset with a 100 queries of depth of two, evenly split between true and false answers. To answer a query of this depth, an LLM must navigate to a set of clauses one "hop" away, linking a target clause (where the variable in question resides), to a set of assigned clauses (containing the assigned conditions). More details can be found in our repository.

This dataset exposes critical deficiencies in popular frontier models:

All tested models exhibit significant performance drops well before reaching their advertised maximum context lengths.
o3-mini, which is optimized for reasoning, shows strong performance at shorter contexts but rapidly deteriorates as the context grows.

Future work will explore how increasing query depth further impacts performance.

Fig. 3 DNF2TEXT results for gpt-4o, o3-mini, and gemini-2.0-flash.

CAMeL: Towards Better Comprehension

We hypothesize that frontier models have impaired comprehension due to an LLMs' inability to focus on key sections in the context window.

To address this shortcoming, we built CAMeL (Content Associative Memory for Language), a language model specialized for retrieving relevant sections from text given a query. Unlike embedding-based approaches, CAMeL interprets sections within their context through its attention mechanism, and it does not require any pre-embedding of the data. Crucially, CAMeL can be used as a retrieval tool or memory store by other LLMs.

Fig. 4 Custom memory-augmentation approach that combines CAMeL with a o3-mini.

We experimented with combining CAMeL with o3-mini and the system yielded performance on this benchmark that rivals top frontier models at short context lengths and outperforms them as context size increases. Observe that this setup allows a reasoning model to maintain strong comprehension at contexts far beyond their native capacity–due to CAMeL having a context window that supports millions of tokens.

Fig. 5 DNF2TEXT results for CAMeL + o3-mini generation versus baseline models on the same dataset in Fig. 3.

Coming Soon

We are working to run DNF2TEXT on recently released Gemini2.5 and Llama4 models to see how they stack up.
We are excited to demonstrate CAMeL's ability to enable comprehension at Millions of Tokens of context.

Benchmark Details

GitHub

Access to CAMeL

API access to CAMeL is currently in early trials. Please reach out if you have Q&A tasks that are not well served by current long context and RAG systems. Book a demo to learn more about our capabilities.

Acknowledgements

The CAMeL augmentation system is built with Meta's open source Llama model. https://www.llama.com/

Pinpointing the Comprehension Gap: How LLMs Struggle with Large Documents