Humans are still better than today’s AI systems at describing and interpreting social interactions unfolding in moving scenes—an ability considered crucial for self-driving cars, assistive robots, and other technologies that must operate safely around people.
New research led by scientists at Johns Hopkins University suggests that many artificial intelligence models struggle to grasp the social dynamics and contextual cues that humans use automatically. The team argues that the limitations may be rooted in how these systems are built, not just in the amount of data they have seen.
“AI for a self-driving car, for example, would need to recognize the intentions, goals, and actions of human drivers and pedestrians,” said lead author Leyla Isik, an assistant professor of cognitive science at Johns Hopkins. “You would want it to know which way a pedestrian is about to start walking, or whether two people are in conversation versus about to cross the street. Any time you want an AI to interact with humans, you want it to be able to recognize what people are doing—and this shows these systems can’t right now.”
Co-first author Kathy Garcia, who worked in Isik’s lab during the study, is set to present the findings at the International Conference on Learning Representations on April 24.
To compare AI performance with human perception, researchers asked participants to watch three-second video clips and rate features that matter for understanding social situations on a scale from one to five. The clips showed people interacting with each other, doing activities side by side, or acting independently.
The team then tested more than 350 AI models across language, video, and image categories. The models were asked to predict how humans would rate the clips and how human brain activity would respond while watching them. For large language models, the researchers provided short, human-written captions describing the videos.
Overall, human participants largely agreed with one another across the questions. The AI systems did not—regardless of model size or training data. Video models often failed to accurately describe what people were doing, while image models given sequences of still frames could not reliably determine whether people were communicating. The researchers also found a split in strengths: language models were better at predicting human judgments, while video models more closely matched patterns of neural activity.
The findings stand in contrast to AI’s strong performance on still-image tasks, the researchers noted.
“It’s not enough to just see an image and recognize objects and faces. That was the first step, which took us a long way in AI,” Garcia said. “But real life isn’t static. We need AI to understand the story that is unfolding in a scene. Understanding relationships, context, and the dynamics of social interactions is the next step, and this research suggests there might be a blind spot in AI model development.”
One possible explanation, the team said, is that many neural networks were inspired by brain systems that process static images—while the brain regions that interpret dynamic social scenes work differently.
“The big takeaway is none of the AI models can match human brain and behavior responses to scenes across the board, like they do for static scenes,” Isik said. “There’s something fundamental about the way humans are processing scenes that these models are missing.”
