Child’s Experience Teaches AI to Understand and Speak Language

February 15, 2024

Researchers trained an AI on headcam footage from a child’s perspective, enabling it to learn words and concepts from the child’s environment.

Children are better at learning than even the very best large language models. ChatGPT was trained on massive data sets containing millions or even a trillion words to write in passable English.

When children are at the age of three, they can communicate in sophisticated ways despite having access to only a tiny fraction of that data.

A group of academics at New York University, however, pondered whether artificial intelligence (AI) can learn like a baby. Brenden Lake, an author of the study and a computational cognitive scientist at New York University, said that the AI model managed to match words to the objects they represent. According to him, there’s enough data, even in this blip of the child’s experience, that it can do genuine word learning.

#linguistics #Languageacquisition #Children
"researchers at New York University wondered if #AI could learn like a baby. What could an AI model do when given a far smaller data set—the sights and sounds experienced by a single child learning to talk?"https://t.co/3eaIiJs5IK

— PAD (@padaignault) February 7, 2024

The experiment

The researchers used 61 hours of footage from an Australian child’s helmet camera for this investigation. For one and a half years, from when he was six months old until just after his second birthday, that child, Sam, wore the camera intermittently.

About one percent of Sam’s waking hours were spent looking at and paying attention to what was captured on video. Sam’s two cats, parents, toys and cot, home, food, and other things were all documented. Lake explained that this data set was unique. According to him, this is the most straightforward view they’ve ever had of what’s available to a single child.

This baby with a head camera helped teach an AI how kids learn language :: The camera captured the things Sam looked at and paid attention to during about 1% of his waking hours | MIT Technology Review https://t.co/fJB4SyGtgo

— PG Holmlov (@pg) February 5, 2024

To train the model, Lake and his colleagues used 600,000 video frames and 37,500 “utterances”—phrases that Sam’s parents or other individuals in the room stated when the picture was taken. Occasionally, the items and words matched. They didn’t always. For instance, a parent remarks, “You like the string,” as Sam examines a form sorter in one still. “You want the blocks too,” a parent adds, pointing to another adult hand covering some blocks.

Cues given by the team

The team gave two cues to the model. Words and objects appearing together may indicate that they might be linked. However, it’s a cue that they aren’t a match when they don’t happen together.

Wai Keen Vong, the author of the study and a computational cognitive scientist at New York University, said that they have this sort of pulling together and pushing apart that occurs within the model. He continued by saying that the hope is that there are enough instances in the data where, when the parent says the word ‘ball,’ the kid is seeing a ball.

Although it may appear straightforward, matching words to the object they represent takes work. To get an idea of the issue, imagine the living room of a family with young children. In addition to the typical living room furniture, it has a lot of clutter. Toys are all over the floor. There are crayons all over the coffee table. A snack cup is placed on the ledge, and a chair has laundry. A toddler may associate the word “ball” with a ball if they hear it. However, it might also refer to any other toy, the sofa, trousers, or an object’s shape, color, or time of day. According to Lake, any word can have an endless number of meanings.