
At the Okinawa Institute of Science and Technology in Japan, a robotic arm reaches out, grasps a red block, and moves it to the left. It’s a simple task, but the machine isn’t just following instructions. It’s learning what the words “red,” “move,” and “left” mean—not as abstract symbols, but as concepts tied to action and experience. This robot, powered by a brain-inspired AI, is taking its first steps toward understanding language the way humans do.
Large language models like ChatGPT can generate fluent text, but they don’t grasp the meaning behind the words. They rely on patterns in data, not real-world experiences. Humans, on the other hand, learn language by interacting with the world. We know what “hot” means because we’ve felt heat. We understand “fall” because we’ve stumbled. Now, a team of researchers is trying to teach AI the same way, inspired by how infants learn language and their first words.
“The inspiration for our model came from developmental psychology. We tried to emulate how infants learn and develop language,” Prasanna Vijayaraghavan, lead researcher and a graduate student at OIST, told Ars Technica.
Currently, the robot can learn only five nouns and eight verbs. However, it proves that AI can start forming connections between words and their meanings, taking the first step toward machines that don’t just recognize language—but truly comprehend it.
Making algorithms understand human language
Developmental psychology suggests that babies constantly interact with their environment, and this interaction plays a crucial role in their cognitive and language learning process. This physical interaction helps them build a mental model of how things work—and how language describes those actions. However, an AI, on the other hand, is a software system built using algorithms and data with no sensory machinery able to interpret this information.
The researchers came up with an interesting solution to this challenge. They integrated their AI model into a robot that could interact and respond to objects in its surroundings. The robot had an arm with a gripper to pick up and move objects. It was also equipped with a basic RGB camera with low-resolution vision (64×64 pixels) to see things around.
Next, they positioned the AI robot so its camera faced a white table on which they had arranged green, yellow, red, purple, and blue blocks. Then, they gave the robot verbal instructions like “move blue right”, or “put red on blue”, and it had to move the blocks accordingly.
While picking and manipulating objects sounds like a super-easy task for any robot, the real challenge here was having the AI process words, and understand their meaning. In the researchers’ words, they wanted to test whether the robot could develop compositionality.
“Humans excel at applying learned behavior to unlearned situations. A crucial component of this generalization behavior is our ability to compose/decompose a whole into reusable parts, an attribute known as compositionality,” the study authors note.
For instance, “The compositionality phase is when children learn to combine words to explain things. They initially learn the names of objects, and the names of actions, but those are just single words. When they learn this compositionality concept, their ability to communicate kind of explodes,” Vijayaraghavan added.
The test was successful as it suggested the development of compositionality in the AI-driven robot. The AI model learned the concept of directional movement, such as shifting objects left or right or stacking one item on top of another. It even combined words to describe new actions, like placing a red block on a blue one.
<!– Tag ID: zmescience_300x250_InContent_3
–>
What was happening inside the AI brain?
In their study, Vijayaraghavan and his colleagues also explained the internal mechanism that allowed their AI model to learn words and their meanings. The AI is based on a 20-year-old theory called the free energy principle that suggests the human brain is constantly making predictions about the world and adjusting them based on new experiences.
This is how we plan actions, like reaching for a cup of tea, and making quick changes if needed, like stopping if the cup is too hot. It sounds like a simple action but it involves a series of carefully designed steps.
The AI robot uses four interconnected neural networks that perform a series of steps so that the AI can learn simple words. One neural network processes images from the camera, enabling the robot to identify objects. Another network helps the robot track its own position and movements, ensuring it can adjust as needed. A third network breaks down spoken commands into a format the AI can understand., and the last network combines everything; vision, movement, and language, so the AI can predict the right action.
After learning a set of commands, it could apply that understanding to new situations. For example, if it knew how to “move red left” and “put blue on red,” it could figure out how to “put red on blue” without explicit training. This was compositionality in action.
This entire setup allowed the AI to understand verbal instructions, connect words, and perform the required actions, much like humans do. Future research will now focus on scaling the AI system and advancing its capabilities.
“We want to scale the system up. We have a humanoid robot with cameras in its head and two hands that can do way more than a single robotic arm. So that’s the next step: using it in the real world with real-world robots,” Vijayaraghavan added.
The study is published in the journal Science Robotics.