Deep Dive Series: Can LLMs Think About Thinking, and Can We Leverage Such Traits?

April 29, 2025

By Anil Ananthaswamy

In 1979, the American developmental psychologist John H. Flavell coined the term metacognition, defining it as “knowledge and cognition about [one’s own] cognitive phenomena.” In a study of preschool and elementary school children, he showed that while the older kids displayed metacognitive abilities, the younger ones didn’t. Flavell surmised that an understanding of metacognition could “someday be parlayed into a method of teaching children (and adults) to make wise and thoughtful life decisions as well as to comprehend and learn better in formal educational settings.”

Flavell was prescient. Since his time, educators have successfully helped kids learn to think about thinking, and studies have shown that the children who rely on metacognition learn better and are more resilient.

A brain wonders if it's an LLM — Credit: Gemini Flash 2.0

It’s reasonable to ask whether large language models (LLMs) display such traits, given that they are now capable of superhuman abilities in many cognitive tasks involving natural language. And indeed, at Princeton Language and Intelligence, researchers have shown that LLMs not only display aspects of metacognition, but that these abilities can be used to improve an LLM’s performance.

But before LLMs could be queried about their metacognition, researchers first had to establish that LLMs had some form of cognition, or understanding. Can LLMs process prompts and generate text in ways that suggest they can make sense of language? Or are they simply randomly recombining text that they encountered during training? In other words, the team had to show that LLMs weren’t mere “stochastic parrots.”

PLI researchers in collaboration with researchers at Google DeepMind did so in a couple of papers published on ArXiv in 2023 (“A Theory for Emergence of Complex Skills in Language Models,” Arora et al. and “Skill-Mix: A Flexible and Expandable Family of Evaluations for AI models,” Yu et al.).

stochastic parrot — Credit: Gemini Flash 2.0

The theory relies on neural scaling laws, which are empirically derived expressions that relate the number of model parameters and the size of the training dataset to the cross-entropy loss the models make on test data. In the first paper, Sanjeev Arora, director of PLI, and Anirudh Goyal of Google DeepMind used random graph theory to link the cross-entropy loss and hence the scale of models to their ability to acquire the skills necessary to make sense of pieces of text. For example, understanding some text might require a single skill (“detecting humor”) or a combination of skills, such as “understanding metaphor” and “using syllogisms.” Under reasonable assumptions, the duo showed that as LLMs become bigger, the models get better at combining skills required for language tasks. So, if a model of some size is skilled at combining k’-tuples of skills out of a total of k underlying skills (k’ << k), then scaling up its parameters by an order of magnitude will increase the model’s performance by a factor of two: it will become competent at combining 2k’-tuples of skills.

In the second paper, Arora and students along with Goyal and Jonah Brown-Cohen of Google DeepMind showed how to test the theory by developing SKILL-MIX, an evaluation designed to measure an LLM’s ability to combine skills. The evaluation started with a list of N basic language skills (each of which has a Wikipedia entry, so is a well-understood linguistic skill) and a set of carefully curated topics, T (a topic could, for example, gardening). An evaluator picked a random subset of k skills from N and a topic at random from T and prompted an LLM to generate a small piece of text about the chosen topic that demonstrated the use of all k skills. This was done repeatedly to generate M pieces of text. These M texts were then auto-graded by a bigger model (in this case, GPT-4 and LLaMA-2-70B-Chat). The texts were also spot-checked by humans. The grading looked at whether the requisite skills had been used to generate the text, whether the text made sense and if it adhered to the prescribed limit on the number of sentences.

One clear result of the evaluations was that GPT-4 is reasonably good at generating correct answers for k = 5, and even for k = 6. The team calculated the probability of such a piece of text appearing in the training data and showed it to be extremely low. Hence, SKILL-MIX demonstrated that GPT-4 was big enough to shed the stochastic parrot moniker. It displayed what can be called cognition.

An LLM thinks, "I'm not a stochastic parrot." — Credit: Gemini Flash 2.0

This brings us to metacognition. Does GPT-4, for example, have metacognitive knowledge, i.e., knowledge about what it knows? Well, one can simply ask it. In work that was done by Arora in collaboration with researchers at Mila, University of Montreal, the University of Cambridge and Google DeepMind (“Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving,” Didolkar et al.), researchers asked GPT-4 to describe the concepts it’d need to solve questions in a math dataset—an indication of its metacognition. They then used these concepts to improve an LLM’s performance on the same dataset.

The team started with, for example, GSM8K, a dataset that contains 8,500 grade school-level math problems. From the SKILL-MIX work, they knew that understanding a question requires skills. So, they took GSM8K training data and prompted GPT-4 to identify a skill that it thought was necessary to answer each question. GPT-4 generated a list of fine-grained skills. Then, they asked the LLM to group these fine-grained skills into a smaller cluster of more compound, or abstract, skills. For example, GPT-4 put “addition,” “subtraction,” and “multiplication” into one group called “basic arithmetic operations.” Next, the LLM was prompted to associate each training question in GSM8K with the appropriate compound skill. This helped create a repository of “skill exemplars”— for each compound skill, this involved randomly picking a few associated question/answer pairs from the training data. The list of {compound skill, Q-A pairs} was the repository.

The repository essentially captured GPT-4’s knowledge about its knowledge, as it pertained to the GSM8K dataset. The next step was to leverage such metacognitive traits. Could one, for instance, improve the performance of an LLM—this could be GPT-4 itself or some other LLM—on the GSM8K dataset by using metacognition?

To find out, the team took a test question from GSM8K, asked the LLM to associate it with one of the compound skills in the repository, signifying that the compound skill was needed to answer the question. Once the LLM had identified a skill, they randomly picked from the exemplar repository one or two Q-A pairs associated with that skill, prepended them to the test question, thus creating in-context examples of how to solve such questions, and prompted the LLM with the augmented test question. The researchers found that the LLM was more effective when given the metacognitive assist. Using an automated pipeline, they showed this to be the case for different LLMs and for different math datasets.

The above study demonstrated the presence of metacognition, and the potential for improving test-time performance. Next, the PLI team, which included Simon Park and Simran Kaur, in collaboration with Goyal, wanted to go one step further. They asked: can a small pre-trained LLM be fine-tuned to follow instructions using a small, synthetic dataset generated using a larger LLM’s metacognition? If successful, this would be a cheap and efficient method for endowing small LLM with new skills.

A Big LLM talks to a Small LLM — Credit: Gemini Flash 2.0

The process, which is called INSTRUCT-SKILLMIX involved two phases.

The first was Skill Extraction. The team started with a strong LLM, which functioned as a teacher, and followed these (automated) steps to create our supervised fine-tuning dataset:

Prompt the frontier model (GPT-4-Turbo) to generate a list of topics relevant for instruction following.
Prompt the same model to take each topic and further identify skills that would be necessary for answering questions on that topic. Also, prompt the model to generate a list of tasks that might be associated with the same topic. For example, a task might be “information seeking” or “help seeking.”

The next phase was Data Generation:

Randomly choose k skills from the list of skills and a task, t. (We chose k = 2.) For {k, t}, prompt the frontier model to generate text in the form of an instruction and its associated response. The team provided detailed examples of such instruction-response pairs in the prompt to guide the LLM.
Generate 4,000 such instruction-response pairs. This was the INSTRUCT-SKILLMIX instruction-following dataset.

While the researchers carried out a number of instruction fine-tuning experiments, some of which involved creating a synthetic instruction-following dataset using existing datasets (such as the Alpaca-52K dataset), they got the best bang for the buck using the INSTRUCT-SKILLMIX dataset. For example, the LLaMA-3-8B base model, fine-tuned on their synthetic dataset, achieved a win rate of 42.76 % on AlpacaEval2.0, doing better than some considerably larger frontier models (Claude 3 Opus at 40.5 percent and LLaMA-3.1-405B-Instruct at 39.3 percent). A win rate refers to the percentage of times one model’s output is judged by a powerful LLM to be better than a competing model’s output. In this case, the competing model was GPT-4.

INSTRUCT-SKILLMIX provides a small but powerful and diverse dataset for efficient supervised fine-tuning. The diversity has two sources. One is the metacognitive knowledge stored in a frontier LLM, which has encountered an immense amount of knowledge that’s inherent in human written language, either during pre-training or fine-tuning. The other source is the use of random combinations of k-tuples of skills: the enormous space of possible combinations allows the frontier LLM to generate novel instruction-response pairs that go beyond regurgitating information that it saw in its training data. Because the INSTRUCT-SKILLMIX dataset contains high-quality diverse and difficult pairs of instructions and responses, only 4,000 such pairs are necessary to fine-tune the small 8-billion parameter LLaMa-3 model, to enable it to match up to its bigger brethren.

Thinking of LLMs as having metacognition is also allowing for a better understanding of the inner workings of these otherwise inscrutable models. For example, researchers at Mila studied DeepSeek R1-Zero’s reasoning traces and found evidence of the LLM having discovered how to check for self-consistency by “ruminating” on its outputs. Others have argued that metacognition can be used to steer LLMs towards behavior that’s safe and aligned with human values. These experiments are likely only scratching the surface of what’s possible by leveraging a frontier LLM’s metacognition.

News

Deep Dive Series: Can LLMs Think About Thinking, and Can We Leverage Such Traits?

Leave a Reply Cancel reply

More posts

Deep Dive Series: Building Biosecurity Safeguards into AI for Science

Machine Learning Helps Reveal What Makes Games Complex for People

Seed Grant Series: Article Friend – Developing a New, AI-Powered Tool to Increase Accessibility of Research Articles

Deep Dive Series: Can LLMs Think About Thinking, and Can We Leverage Such Traits?