The Book of Life: Moving from a Sociology of Variables to a Sociology of Events

By Anil Ananthaswamy

In Dec 1946, a paper titled “Record Linkage” appeared in the pages of the American Journal of Public Health. It began with these words: “Each person in the world creates a Book of Life. This Book starts with birth and ends with death.” The author, biostatistician Halbert Dunn, who was then the head of the U.S. National Office of Vital Statistics, coined the phrase “record linkage” to describe the process of putting together such a Book of Life, whose pages would describe the salient trajectory of one’s life.

Dunn further wrote: “Events of importance worth recording in the Book of Life are frequently put on record in different places since the person moves about the world through-out his lifetime. This makes it difficult to assemble this Book into a single compact volume.”

Not anymore. It has taken almost 80 years, but Princeton University researchers are bringing Dunn’s idea to fruition, thanks to two major advances: the availability of myriad forms of data capturing the behavior of people over their lifetimes and the invention of large language models (LLMs), which make it possible to train these models on such data and then use them to answer queries about someone’s life.

Matthew J. Salganik, the Alexander Stewart 1886 Professor of Sociology at Princeton University, and colleagues in the Department of Sociology and the Center for Information Technology Policy at Princeton have developed a toolkit for creating and analyzing an individual’s Book of Life, using data from Statistics Netherlands, a Dutch government agency tasked with compiling information about the country’s inhabitants. “The Netherlands is the only place I know that has this rich registry data with clear legal and ethical guardrails and access to the registry data in a high-performance computing environment,” said Salganik.

The eventual goal—though the researchers are some ways from achieving it—is to train foundation language models using data from millions of Books of Life (some of which might go from birth to death), to help accurately predict an individual’s life outcomes. “In theory, we [should be able to] write out everything about your medical records, your social history and the environment that you’re in, and then have [the LLM] continue generating what comes next, including medical outcomes,” said Salganik.

Matt Salganik, speaking about his research at the AI Lab End-of-Year celebration, held on the Princeton University campus in May 2025. Photo by Sameer Khan.

Inspiration From Netflix

Sociologists have long sought to develop frameworks to predict life outcomes for individuals, given the current trajectories of their lives. Will someone have a child? Will someone slide into poverty? Will they get heart disease? Answers to such questions have obvious social, healthcare and other ramifications. But the task has proven somewhat intractable.

In fact, the Book of Life project spawned in part out of Salganik’s disappointment with the results of a challenge designed to test the predictive powers of existing methods, using some of the best sociological data. “The fact that we have what we think of as a large, high-quality dataset and we can’t predict basic life outcomes, that was troubling,” said Salganik.

The challenge had its roots in a longitudinal study started in 1998, by sociologists Sara McLanahan of Princeton University and Irwin Garfinkel of Columbia University, called The Fragile Families and Child Wellbeing Study, later renamed The Future of Families and Child Wellbeing Study. The researchers randomly selected hospitals in 20 US cities, and then on randomly chosen days, they went to these hospitals and chose—again randomly—a sample of babies born on those days, interviewed the parents to learn about the beginnings of the family, and then followed more than 4,000 such families over time. By about 2015, these children and their families had been surveyed in six waves, first at birth and then at ages 1, 3, 5, 9 and 15. This resulted in rich data, with each child being characterized using more than 12,000 different variables.

When the project was set to release the sixth wave of data, Salganik, McLanahan and others realized that they were sitting on a treasure trove. “This is like a magical moment,” said Salganik, something that happens with every longitudinal study that collects data about the same cohort of participants over an extended length of time. The moment opens up a unique opportunity: can one use the data from prior years to predict the data that is about to be unveiled? Sociologists normally don’t think along these lines, so no one had ever taken on such a challenge.

As it happened, Salganik was aware of related efforts—also known as the common task method—in artificial intelligence and machine learning. For example, in 2010, Fei-Fei Li of Stanford University and her team instituted the ImageNet Large Scale Visual Recognition Challenge. Competitors had to train their image recognition algorithm on about 1.2 million images of objects belonging to 10,000 categories and submit their algorithm for evaluation on images that had been kept out of the training dataset. The algorithm that best categorized the images in the held-out data was deemed the winner. In 2012, for the first time, a deep neural network called AlexNet won the ImageNet challenge, triggering the ongoing revolution in AI.

Salganik was inspired by the ImageNet challenge, and also by the Netflix Prize, which ran from 2006-2009. Netflix provided data of about half-a-million users (identified only as numbers), more than 17,000 films and about 100 million ratings the users gave to the films. Using this training data, contestants had to use machine learning algorithms to predict a user’s ratings for films; these ratings had been held out from the training data and were used to judge an algorithm’s accuracy. In 2009, Netflix awarded $1 million to a team that improved upon the company’s internal algorithm by 10 percent.

It was along these lines that Salganik and McLanahan, along with Ian Lundberg and Alex Kindel, who were then PhD students at Princeton, fashioned their challenge. The task: use the data collected about families until the children were 9 years old (the first five waves) to predict six outcomes about the families when the children turned 15 (the sixth wave). These outcomes were: a child’s GPA; the child’s grit, a psychological measure of passion and perseverance; the extent of material hardship experienced by the household; whether the primary caregiver lost a job; whether a primary caregiver participated in job training; and whether the household faced eviction.

A total of 160 research teams participated. The teams used any and all available machine learning algorithms: linear regression, logistic regression, random forests, support vector machines, and even neural networks. No one succeeded, even though the data used was of the highest quality for this kind of study. “The best predictions were not very accurate and were only slightly better than those from a simple benchmark model,” wrote Salganik and colleagues. For some sociologist, though, this wasn’t surprising. Predicting what’s going to happen in life is hard.

But “why is predicting these life outcomes hard? What is the mechanism that leads to that?” said Salganik. “I would be fine with us not being able to predict very accurately if we could explain why.”

Sociological Dark Matter

To understand the origins of such unpredictability, Salganik, Lundberg, who is now at the University of California, Los Angeles and colleagues analyzed data collected via in-depth, qualitative interviews with 40 families who had participated in the Fragile Families Challenge, one of the projects that had used the FFCWS data. Their intent was to find the dark matter that wasn’t in the data, a phrase taken from astronomy, to refer to unseen matter that we think is out there in the universe. Could such dark matter potentially explain the variability in the prediction of life outcomes?

For example, they looked at grades of a group of specially selected children at age 15 (these children were identified using the best predictive model from the Fragile Families Challenge) . “Our hope was that we could find out what is helping some of these kids do better than expected, what is causing some of these kids to not do as well,” said Salganik. “We wanted to just go find the dark matter.”

But in the end, not unlike in astronomy, they came up empty handed.

Instead, the researchers arrived at a mathematical formulation for the errors in the prediction of social outcomes, showing that the error could be decomposed into two components. The first component is an irreducible error, which depends on the outcome or feature being predicted (in this case, the grades) and is the “average squared difference between each person’s outcome and the true (but unknown) mean among people who are observationally identical to them.”; this error is irreducible because it doesn’t depend on estimates made by some model that has learned to make predictions, but rather only on the intrinsic data at hand. The second component is the learning error, which depends on both the predictions that a model has learned to make and the actual outcomes. “We can’t actually measure which of these components is bigger or smaller, because we don’t know how to measure one of them. But we know that these two components exist,” said Salganik.

It was clear that the data from their longitudinal studies, even the impressively large Future of Families dataset, wasn’t enough to reduce total prediction error. They speculated about “what would happen as the number of cases goes to infinity, as the number of features goes to infinity, or both,” said Salganik. “We couldn’t do that with the [Future of Families] data. We realized we needed a fundamentally new data source, if we wanted to study these limits.”

Going Dutch

That’s where Statistics Netherlands comes in. The organization has created a system of datasets that “contains a wealth of information on persons, households, jobs, benefits, pensions, education, hospitalizations, crime reports, dwellings, vehicles and more.” Given the nature of the data it is well protected; researchers can only access these datasets using Statistics Netherlands’ secure computing environment, and even then, access to the data is heavily restricted. . Salganik and his colleagues were able to work with this data as part of a prediction challenge organized by Gert Stulp, Elizaveta Sivak, and other researchers at the University of Groningen and ODISSEI, a Dutch national research organization focused on the social sciences.

“The Dutch population registry is amazing, but it is by far the most complex data I’ve ever used. It’s a series of different files collected by different parts of the government that are all stored and organized slightly differently. Some of them have one row for each person in each year. Some of them have one row for each household. Some of them are timestamped events. They’re all not exactly the same,” said Salganik. “All of this data is there, but it’s not put into a uniform representation.”

Traditionally, sociology research has involved organizing such data into one large matrix. Think of each row of the matrix as representing one person, and each column representing some feature or variable about the person. Associated with each row is some life outcome. The task is to learn to predict the outcome given the row. “That paradigm doesn’t feel natural for [studying] life trajectories,” said Salganik.

The idea of using sequences of life events for sociological research isn’t novel. In 1995, sociologist Andrew Abbott of the University of Chicago wrote a paper titled “Sequence Analysis: New Methods for Old Ideas.” Abbot wrote: “A quiet revolution is underway in social science. We are turning from units to context, from attributes to connections, from causes to events. The change has many antecedents: the exhaustion of our old paradigm, our inherent desire for change, the new powers of computers. It also has many consequences: new areas for empirical work, new methodologies, rediscovery of important old theories.”

Abbott was ahead of this time. Sociologists in the 1990s had neither the data nor the computing power necessary for sequence analysis. Not so anymore. “We can move from a sociology of variables to a sociology of events,” said Salganik.

The central idea is to take data about individuals—such as credit card events, health events, demographic and educational events and so on— that are stored in different databases and organize them into a time series of events that represent a person’s life trajectory. This would take into account contextual information, such as events of everyone in a person’s family, or events happening in one’s neighborhood. The intent is to capture a person in context over time. Salganik and colleagues went one step further. They turned each life event into a description in natural language (in this case, English). “So, what we did with the Dutch registry is we took all of these different records stored in different ways, at different units of analysis, and tried to put them together into a book that’s written out in language,” said Salganik.

This directly connects back to Halbert Dunn’s 1946 idea of “record linkage,” so the researchers named this way of representing an individual’s data The Book of Life. For now, they have accomplished something relatively simple. The Book of Life is organized as [key: value] pairs. For example, a key is “City of Birth,” the corresponding value is “Amsterdam.” Nonetheless, it represents a shift from a quantitative matrix of numbers to something qualitative, written in natural language.

Reading the Book of Life

This shift allowed the researchers to use one of the most powerful tools of modern AI: large language models. LLMs such as OpenAI’s ChatGPT, Anthropic’s Claude and Google’s Gemini are trained, given a sequence of words as input, to predict the next word (or token, which is a sub-unit of a word, but the use of words suffices here). During inference—the phase when a user interacts with the trained LLM—an algorithm appends the predicted word to the user input and feeds it back to the LLM as input, which then predicts the word to follow the augmented sequence. This process, called auto regression, continues until the LLM predicts an end-of-text token or simply runs out of tokens. LLMs that have been pre-trained on almost all of the freely available text on the internet (whether in natural language, scientific notation or programming code) show surprising behavior: the output in response to a user’s prompt can resemble a reasoned answer to a question posed in the prompt. Pre-trained models are often further trained, or fine-tuned, for specific tasks—such as engaging in chats or answering math questions. This requires more training using curated datasets for the task at hand.

In the Princeton team’s case, the curated dataset comprised of millions of Books of Life. Given the constraints of Statistics Netherlands’ secure computing facility, the team could not use the biggest available LLMs, which require sending data to external servers hosted by companies such as OpenAI. Instead, the team had to bring the model to the data, restricting them to open-weight models that could run on servers within the secure computing facility. So, they chose Llama 3.1 8B Instruct, an 8-billion-parameter open-weight model. The team fine-tuned the pretrained LLM on their Book of Life dataset, which had information of about 6 million people, of which about 4.1 million were used for training, and the rest were held out for testing. This fine-tuned model was trained to predict whether or not some individual had a child between 2021 and 2023. While their approach beat a simple ML model that used only age and sex for features, it couldn’t outperform a better, classic machine learning algorithm. But the researchers believe that the LLMs fine-tuned on Books of Life have a lot more headroom to improve. This includes using bigger pretrained foundation models and using more data from different domains of a person’s life and more high-performance compute for finetuning.

“If it still doesn’t work, I think that’s very interesting. And if it does work, that’s very interesting,” said Salganik. “We just have to try it.”

For now, Salganik is optimistic. “One of the things we want to do next is make this more auto regressive. The book would just have a bunch of words, and then [the LLM] should just predict the next words and write out the rest of your life.”

All this, of course, raises questions about ethics and privacy. As a society we need laws governing access to such integrated data, said Salganik. The Dutch have shown that it can be done. They have clear guardrails in place: all data is de-identified, all the computing is carried out in a safe environment disconnected from the world outside, and there’s strong oversight when it comes to anyone accessing the data. Also, Dutch laws prohibit model predictions from this data from being used in real life. “This is 100 percent science. We’re not trying to influence anything. We need some basic science before we even think about how to do it in the real setting,” said Salganik. “But, even if we never use these predictions to guide decisions about individual people, I expect that this research will lead to new scientific insights about how life trajectories develop.”

Leave a Reply

Your email address will not be published. Required fields are marked *


Discover more from Princeton Laboratory for Artificial Intelligence Research Blog

Subscribe to get the latest posts sent to your email.