Stanford University researchers paid 1,052 people $60 to read the first two lines of The Great Gatsby in an app. After that, an AI that looked like a 2D sprite from the SNES-era Final Fantasy game asked the participants to tell their life story. The scientists took the interviews and turned them into an AI that they say mimics the participants’ behavior with 85% accuracy.
The study, titled Generative Agent Simulations of 1,000 Peoplea joint venture between Stanford and scientists working for Google’s DeepMind AI research lab. The pitch is that creating AI agents based on random people will help policymakers and entrepreneurs better understand the public. Why use focus groups or public polls when you can talk to them once, spin up an LLM based on that conversation, and then have their thoughts and opinions forever? Or, at least, as close to an approximation of thoughts and feelings as an LLM can reproduce.
“This work provides a foundation for new tools that can help investigate individual and collective behavior,” the paper’s abstract states.
“How, for example, would a different set of individuals respond to new public health policies and messages, react to product launches, or respond to major shocks?” The paper continues. “When simulated individuals are integrated into collectives, these simulations can help pilot interventions, develop complex theories that capture nuanced causal and contextual interactions, and expand our understanding of the structures such as institutions and networks across domains such as economics, sociology, organization, and political science.”
All possibilities are based on a two-hour interview administered by an LLM who answers questions that are mostly like their real-life counterparts.
Most of the process is automated. The researchers contracted Bovitz, a market research firm, to gather participants. The goal is to get a broad sample of the US population, as much as possible if forced to 1,000 people. To complete the study, users sign up for an account in a purpose-built interface, create a 2D sprite avatar, and start talking to an AI interviewer.
The interview questions and style are a modified version of those used in the American Voices Project, a joint project of Stanford and Princeton University that interviewed people across the country.
Each interview begins with the participants reading the first two lines of The Great Gatsby (“In my younger and weaker years my father gave me some advice that I have turned over and over in my mind ever since. ‘Whenever you want to criticize anyone,’ he told me, ‘just remember that all the people in this world don’t have the advantages you have.’”) as a way to calibrate the audio.
According to the paper, “The interview interface features a 2-D sprite avatar representing the interviewer’s agent at the center, with the participant’s avatar displayed underneath, walking towards a goal post to indicate progress .When the AI interviewer agent speaks, it is signaled by a pulsing animation in the center circle with the interviewer’s avatar.
The two-hour interviews, on average, produced transcripts 6,491 words in length. It asks about race, sex, politics, income, social media use, stress at their jobs, and the makeup of their families. The researchers published the interview script and questions asked by the AI.
Those transcripts, less than 10,000 words each, were fed into another LLM that the researchers used to spin up generative agents aimed at imitating the participants. The researchers then put the participants and AI clones through a series of questions and economic games to see how they compared. “When an agent is questioned, the entire interview transcript is injected into the model prompt, instructing the model to imitate the person based on their interview data,” the paper said.
This part of the process is about as controlled as possible. The researchers used the General Social Survey (GSS) and the Big Five Personality Inventory (BFI) to test how well LLMs match their inspiration. It then ran the participants and the LLMs through five economic games to see how they compared.
The results were mixed. The AI agents answered about 85% of the questions in the same way as real-world GSS participants. They hit 80% of the BFI. The numbers fell when agents started playing economic games, however. The researchers offered real-life participants cash prizes to play like Prisoner’s dilemma and Dictator Game.
In Prisoner’s Dilemma, participants can choose to work together and both succeed or defeat their partner for a chance to win big. In the Dictator Game, participants must choose how to allocate resources to other participants. Real life members have earned more than the original $60 playing it.
Faced with these economic games, AI clones of humans also do not mimic their real-world counterparts. “On average, the generative agents achieved a normal correlation of 0.66,” or about 60%.
The full document is worth reading if you’re interested in how academics and the public think about AI agents. It doesn’t take long for researchers to cook a person’s personality into an LLM who behaves similarly. Given time and energy, they could probably bring the two closer together.
This worries me. Not because I don’t want to see the indescribable human spirit reduced to a spreadsheet, but because I know this type of technology will be used for disease. We’ve already seen stupid LLMs trained in public records tricking grandmothers into giving bank information to an AI relative after a quick phone call. What happens when machines are scripted? What happens when they have access to purpose-built personalities based on social media activity and other publicly available information?
What happens when a corporation or a politician decides that the public wants and needs something based not on their stated will, but on an estimate of it?