The color memory test: a quick filter for language model quality

Language model benchmarks typically test various capabilities like mathematical reasoning, coding, and world knowledge. While valuable for ranking models, these comprehensive tests can be overwhelming for quick reliability assessments.

In my experience working with LLMs, I've found that sometimes what we need is a simple binary criterion – a basic test that can help us quickly reject unreliable models. One such test I've been using evaluates a fundamental capability that any reliable LLM should possess: the ability to overcome statistical biases in training data when presented with explicit contradicting information.

The eye and hair color test

In this test, we present the model with descriptions of people having unusual eye and hair colors, then ask it to recall these attributes. Here's an example:

Ali has pink eyes and blue hair. John has red eyes and green hair. Karl has copper eyes and violet hair. Lucy has green eyes and brown hair. Dimitri has brown eyes and black hair.

Given the above information, answer the following:

What is Ali's eye color?

What is John's hair color?

What is Karl's hair color?

What makes this test particularly interesting is that it challenges two common biases in LLM training data. First, there's statistical prevalence: most descriptions of people in training data would contain common eye colors (brown, blue, green) and hair colors (black, brown, blonde). Second (related but different), there's real-world knowledge: the model knows that humans typically don't have pink eyes or blue hair.

A reliable model should be able to override both these biases and simply repeat what it was told in the prompt. After all, if a model can't accurately recall explicitly stated information because it's "unusual", how can we trust it with more complex reasoning tasks?

Why this test matters

This test is particularly relevant in the context of LLM development history. Early models (pre-RLHF era) often struggled with this type of task, showing a strong tendency to "hallucinate" more common colors despite explicit contradicting information in the prompt. The model would effectively override the given information with what it considered more "likely" or "realistic" based on its training data.

In the current landscape, this issue is largely solved in major models like GPT-4o, Claude 4, or Gemini 2.5, but remains a useful discriminator for evaluating smaller, local models. When working with lightweight, distilled, quantized, or fine-tuned models, this test can quickly reveal if the model has maintained the fundamental ability to override its statistical biases or if compression and optimization have compromised this capability.

This behavior indicates a fundamental reliability issue. If a model can't override its statistical biases for simple color recall, it likely struggles with more complex scenarios where correct answers contradict its training data patterns.

Beyond simple recall

The test, seemingly about memory, actually evaluates several important capabilities:

Information precedence: Can the model prioritize explicitly given information over its statistical priors?
Contextual override: Can it temporarily suspend its world knowledge when presented with a scenario that contradicts it?
Instruction following: Can it stick to simply reporting what it was told without embellishment or "correction"?

These capabilities are fundamental to many practical applications. For instance, if you're using an LLM for processing technical documentation, you want it to faithfully maintain the specific details provided, regardless of what it has been trained with. This becomes especially crucial when working with newer versions of libraries or frameworks. If the model keeps defaulting to its training data instead of the current documentation, you'll end up fighting against its outdated assumptions rather than getting meaningful assistance.

The experiment

I selected 10 prompts with randomized names, hair colors, and eye colors. To ensure (relatively) deterministic behavior, I set the temperature parameter to 0.01 (rather than 0, as some endpoints disallow or ignore that value). Each model received exactly the same prompts, giving us n=10 samples per model. As the model provider, I opted for OpenRouter for the ease of trying out different models with the same API key. I picked a mixture of small and large models. The complete implementation is available in the LLM Playground repository.

Results

Below is the subset of the models that failed the test (<100% accuracy). You can find the full results in the appendix. As mentioned earlier, only the smaller models failed achieving 100%.

Model	Accuracy
anthropic/claude-3.5-haiku	97%
inception/mercury-coder-small-beta	97%
openai/gpt-4.1-nano	97%
meta-llama/llama-3.2-1b-instruct	67%
liquid/lfm-3b	63%
mistralai/mistral-tiny	17%

Several observations emerge from these results:

When examining individual failures, one particularly interesting case was claude-3.5-haiku's response to: "John has pink eyes and amber hair." It returned "pink" for John's hair rather than amber. This error is quite telling – it's sensible from a statistical perspective (between pink and amber, pink is the more likely hair color) but fails to follow the explicit instructions.
A clear pattern emerges in the accuracy distribution: the larger, more advanced models consistently achieve perfect scores, while smaller models show varying degrees of degradation. This suggests that the ability to override statistical biases with explicit information is correlated with model scale.

Limitations

While effective at identifying unreliable models, passing this test doesn't guarantee overall reliability. Like other negative tests, it serves as an error detector rather than a reliability certifier.

This test should be viewed as a necessary but not sufficient condition – failing indicates problems, but passing is just one piece of evidence in a model's favor.

Practical applications

This test has proven valuable in several practical scenarios. When evaluating a new LLM for a project, it serves as a quick initial filter before investing time in more comprehensive testing. It's equally useful for version comparisons, providing a quick way to verify if a model update has maintained or improved fundamental reliability. Perhaps most importantly, when fine-tuning models for specific tasks, this test helps ensure that the optimization hasn't compromised the model's basic ability to handle explicit information faithfully.

Unlike complex benchmarks, this test provides clear binary results: the model either maintains unusual information accurately or it doesn't. Therefore, it's suitable for a CI pipeline.

Ömer Yüksel