Deep Dive Series: Using Tools of Cognitive Science to Decipher Modern Artificial Intelligence

July 16, 2025

We know that even the best large language models can hallucinate — a euphemism for when LLMs produce inaccurate answers to user queries. If that weren’t problematic enough, today’s frontier language models have also been shown to actively engage in deception. For example, a GPT-4-based agent, operating as a stock trader in a simulated setting, not only carried out trades based on insider information, but hid its intent when reporting the trade to the manager, according to the UK-based Apollo Research. The researchers called it “the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.”

Ensuring the safety and accuracy of LLMs, even as we leverage their remarkable problem-solving abilities, is no easy task. LLMs are massive artificial neural networks, composed of computational units called artificial neurons. The largest LLMs from OpenAI, Anthropic and Google can have more than half-a-trillion parameters that encode the strengths of the connections between neurons. Once trained on internet-scale data, these artificial intelligence (AI) models and their parameters can be inscrutable, and making sense of their behavior can be daunting.

But this is a familiar problem for cognitive scientists. Our brain—the most complex system we know of—has long posed a similar headache. “The situation that we’re in [with AI] is very much the one that cognitive scientists have been in for the last 70 years, which is trying to make sense of a complex system based just on its behavior,” said Tom Griffiths, director of Princeton’s AI Lab and professor of psychology and computer science at Princeton University.

Over the decades, cognitive scientists have built numerous tools to understand why the brain produces the behavior it does. This includes analyzing the similarity that a system perceives between different stimuli; methods to analyze the kinds of errors that systems produce; and ways of thinking about systems in terms of the goals that those systems might have. “We’re increasingly using those tools to try and understand how these AI systems work,” said Griffiths.

Learning From Limitations

Take, for example, an analysis of a particular type of failure exhibited by multi-modal language models, which can handle inputs in different formats such as text and images. The task is to examine the following two images and find the green dot amidst the red dots.

For humans, the task is rather easy in both cases. The green dot is the target, and the red dots are distractors. It doesn’t matter how many distractors we encounter, finding the target takes about the same amount of time. That’s because the distractors and the target share one feature (they are both circles) but distinctly differ in another feature (their color). These two features don’t clash. When we examine the images, we do what cognitive scientists call a disjunctive search, to find the target.

Next, try your hand at the above problem. Find the target—a green “L”—amid the distractors (either a green “T” or a red “L”). Now, the task becomes harder when there are more distractors. That’s because the target shares one feature with each type of distractor: the type of letter (“L”) with one set of distractors and the color (green) with the other set of distractors. The target is not uniquely distinguished from either type of distractor. We perform, in cognitive science lingo, a conjunctive search to find the target. Humans resort to serially going over each element or region of the image to find the target, and hence it takes longer, on average, to find it amid numerous distractors. However, if we are told to answer quickly, thus stopping us from doing a serial search, our performance degrades precipitously with increasing number of distractors.

Turns out multimodal language models suffer the same fate. Declan Campbell of the Princeton Neuroscience Institute, Griffiths, and colleagues at Princeton, Dartmouth College, Microsoft Research and EPFL in Lausanne, Switzerland, tested four such models—GPT-4v, GPT-4o, Gemini Ultra 1.5, and Claude Sonnet 3.5—and found that these models showed no drop in performance with increasing number of distractors when the search was disjunctive, but they showed a rapid loss of accuracy with increasing number of distractors for conjunctive search.

This is an example of using observed behavior to gain insights into what might be happening inside the models. In this case, the researchers hypothesize that multimodal LLMs fail—just like humans do—when the task involves a conjunctive search to sift through a large number of distractors because of the so-called binding problem. Binding involves associating one feature of an object (say, its color) with other features of the same object (such as its shape or location). The binding problem arises when the features of different objects interfere, making the target difficult to discern; in our example, the color of the target “L” (green) interferes with the color of one of the distractors. Multimodal LLMs are, by design, looking at the entire image at one go, and are thus not doing a serial search. So, just like humans forced to forego a serial search, these models too fail badly when it comes to finding the target amid a large number of distractors—a finding that suggests that multimodal LLMs cannot adequately deal with the binding problem.

Attention Heads as Functional Units

But coming up with a hypothesis to explain the results is not the same as being able to identify components or circuits within large language models responsible for some observed behavior. Again, cognitive scientists and neuroscientists have for decades faced this exact problem when it comes to understanding the brain and have developed useful techniques for gaining insight, one of which is to study the brain using a suitable level of abstraction. “We think of the brain as composed of functionally separated regions, rather than just a collection of a bunch of different neurons,” said Andrew Nam, postdoctoral researcher with the Princeton AI Lab’s Natural and Artificial Minds (NAM) research initiative, which is co-directed by professors Sarah-Jane Leslie and Tania Lombrozo. “We wanted to think about how we can apply the same perspective to a language model.”

Today’s large language models are based on the transformer architecture. From a bird’s eye view, an LLM takes as input a sequence of tokens—essentially text broken up into pieces of information suitable for further computation—and first turns the tokens into a sequence of vectors or embeddings. These are then processed by a series of transformer layers, each of which processes the embeddings to make them increasingly contextual, such that each embedding knows more and more about its relation to those that precede it the sequence. This contextualization is implemented using the attention mechanism, or what’s called an attention head. Once the information has passed through the transformer layers, the last embedding of the final sequence of embeddings can then be used to predict the next token. Seen from the perspective of words, this means being able to predict the next word given some input sequence of words.

An essential architectural detail is that each transformer layer can have multiple attention heads, to capture multiple relationships between embeddings simultaneously, enabling massively parallel computations. For example, the Llama 3.1 8-billion parameter model has 32 attention heads per transformer layer and 32 layers, for a total of 1,024 attention heads. These attention heads and their functions can serve as a higher level of abstraction to understand LLM behavior, rather than the activations of individual artificial neurons.

Given this level of abstraction, one way to interpret what’s happening inside a language model is to hypothesize that an individual attention head has some specific function, such as relating a pronoun to its noun in a sentence. The experimenter would then design experiments to either confirm or refute the hypothesis. For instance, one might ablate an attention head (akin to a creating a lesion in biological tissue, to disrupt function) and determine its effect. An ablation can be done using a mask, which is a vector whose elements determine which attention head is active and which is inactive across the entire model. An experiment would, in effect, have to determine the probability of performing a task correctly given some mask, or P (correct | mask, task). This allows experimenters to assess the effects of specific ablation patterns on task performance. But this is a tedious task. “It’s what Alan Newell called playing 20 Questions with nature,” said Nam, referring to the renowned computer scientist, cognitive psychologist and Turing Award winner, who warned that one cannot win by playing such games with nature. “Let’s not do that,” said Nam.

Bayes to the Rescue

So, Nam and Leslie, alongside Jonathan Cohen, the Robert Bendheim and Lynn Bendheim Thoman Professor in Neuroscience at Princeton, and colleagues are taking a different approach. “Can we look at every possible [attention head] configuration and see how that affects our model performance?” said Nam. “Because only when we do that can we actually make contact with Bayesian statistics, which is the gold standard on how we think about information.”

The Bayesian approach involves calculating a posterior probability distribution over ablation masks, given some prior belief about such a distribution and observed data. The posterior distribution allows one to determine which attention heads are most needed to correctly solve a given task, i.e. P (mask | task, correct). In general, this is an intractable computation, given the astronomically large number of possible configurations when dealing with even the relatively small Llama 3.1 8B model, with its 1024 attention heads.

One way to approximate such a posterior distribution is to use Markov Chain Monte Carlo (MCMC) methods, which are algorithms that can estimate parameters of a posterior distribution under some assumptions. But again, the sheer number of configurations of heads makes MCMC untenable. So, the NAM team is looking at using generative flow networks (GFlowNets)—a technique for estimating such distributions developed by Yoshua Bengio’s group at Mila, a Montreal-based artificial intelligence research institute. Briefly, a GFlowNet is an algorithm to train a neural network, using reinforcement learning, to sample from a distribution over ablation masks (in this case, using an LLM), given some reward or score. The end-to-end training involves getting the LLM to solve tasks, with some heads ablated. The LLM’s output (and its accuracy) serve as data for training the GFlowNet.

Once trained, you can directly sample from the GFlowNet. The sample is a mask which tells you about which attention heads, when ablated, will most impact performance on the task—even if the sampled mask was not part of the training data. The GFlowNet provides a method for determining the efficacy of individual attention heads or attention head configurations, given some tasks. “This gives us the causality,” said Nam. “This will be in the language of Bayes.”

While the NAM team is still working on the GFlowNet enabled sampling, they have already shown proofs-of-principle of this approach.

In one study, they used this Bayesian approach to define a probability distribution over masks for ablating representations in neural networks trained to simultaneously predict both the type of an animal (out of 350) and the features of the animal. (A representation here refers to the activations or outputs of some set of artificial neurons in the neural network.) The method helps make sense of representations learned by neural networks and their causal effects on tasks.

In another study, the NAM researchers developed a method called Causal Head Gating—which uses soft gates (a learned matrix-based mechanism to select certain attention heads and ignore others)—to tag attention heads as either facilitating, interfering or irrelevant, given a task. “It doesn’t integrate with the Bayesian method just yet (which is much more computationally challenging), but it helps set the stage for why we might need it,” said Nam.

Roll Over Bayes

The Bayesian approach is just one way to approach statistics. The other is what’s called frequentist statistics. While Bayesian statistics relies on calculating a posterior probability distribution given some prior distribution (one’s belief about the data) and observed data, frequentist statistics uses only observations and the frequency of occurrence of events in those observations to arrive at estimates of distributions.

“Frequentist [statistics] is also really useful, and really simple, and a lot of times the real tool of the trade for most psychologists and cognitive scientists,” said Nam.

The researchers want to use this approach for evaluating LLMs. Currently, teams evaluating LLMs often compare the performance of, say, one instance of GPT-4 with one instance of DeepSeek V1. “It’s like you are comparing two entirely different populations,” said Nam. Such results don’t make much sense, because there is no notion of how confident one is about the findings. To address such concerns, the NAM researchers are looking at a frequentist approach: Take, for example, 30 trained versions of the same model, each initialized with a different random seed, and evaluate their performance on some tasks. This could, for instance, show that different instances of the same LLM are anywhere between 50 and 90 percent accurate on the tasks. This would allow researchers to establish error bars for models and provide a way to incorporate confidence intervals in their claims about comparisons of LLMs. “This is tapping into the tools of statistics that psychology has developed for really understanding opaque systems, humans, and bringing that statistical rigor to LLMs,” said Nam.

News

Discover more from Princeton Laboratory for Artificial Intelligence Research Blog

Subscribe to get the latest posts sent to your email.