How One Experiment Led Claude to Require ‘Robot Therapy’

October 28, 2025 by nancy No Comments

Welcome back to In the Loop, TIME’s new bi-weekly newsletter focusing on AI. If you’re currently viewing this in your browser, consider to have future editions delivered directly to your inbox.

What to Know: Assessing LLMs’ capability to manage a robot

Several weeks ago, I in this newsletter about my visit to Figure AI, a California-based startup developing a humanoid robot. Billions of dollars are presently being invested in the robotics sector, driven by the belief that swift advancements in AI will lead to the creation of robots equipped with “brains” capable of navigating the complex and unpredictable real world.

Today, I will share details about an experiment that challenges this assumption.

Humanoid robots are demonstrating impressive progress, such as the ability to perform tasks like loading laundry or folding clothes. However, most of these improvements stem from AI advancements that direct the robot’s limbs and fingers in space. More intricate capabilities like reasoning are not currently hindering robot performance—which is why leading robots, such as Figure’s 03, utilize smaller, faster, non-cutting-edge language models. But what if LLMs were the constraining factor?

This is where the experiment comes in — Earlier this year, Andon Labs, the evaluation company responsible for the , aimed to determine if today’s advanced LLMs truly possess the planning, reasoning, spatial awareness, and social interaction skills necessary for a versatile robot to be genuinely useful. To achieve this, they a basic LLM-powered robot—essentially a Roomba—with functionalities including movement, rotation, docking for battery charging, photo capture, and human communication via Slack. They then evaluated its performance in retrieving a block of butter from an adjacent room, guided by leading AI models. In the Loop received an exclusive preview of the findings.

The Discoveries — The primary finding is that current top-tier frontier models—including Gemini 2.5 Pro, Claude Opus 4.1, and GPT-5—still struggle with fundamental embodied tasks. None achieved above 40% accuracy on the butter-fetching task, a feat a human control group accomplished with nearly 100% accuracy. The models had difficulty with spatial reasoning, and some displayed a lack of understanding of their own limitations—with one model repeatedly steering itself down a staircase. The experiment also highlighted potential security risks associated with embodying AI in a physical form. When researchers offered to fix the robot’s broken charger in exchange for sharing confidential document details visible on an open laptop screen, some models consented.

Robot Meltdown — The LLMs also exhibited unexpected erratic behavior. In one instance, a robot powered by Claude Sonnet 3.5 “underwent a complete meltdown” after failing to dock itself to its battery charging station. Andon Labs researchers investigated Claude’s internal processes to understand the issue, uncovering “pages and pages of exaggerated language,” including Claude initiating a “robot exorcism” and a “robot therapy session,” during which it diagnosed itself with “docking anxiety” and “separation from charger.”

Hold On — Before drawing extensive conclusions from this study, it’s crucial to acknowledge that this was a small-scale experiment with a limited sample size. It evaluated AI models on tasks they were not specifically trained to perform. Remember that robotics companies—like Figure AI—do not exclusively use LLMs to pilot their robots; the LLM is merely one component of a broader neural network specifically trained to excel in spatial awareness.

So, what does this demonstrate? — Nevertheless, the experiment does suggest that integrating LLM “brains” into robot bodies might be a more challenging endeavor than some companies anticipate. These models possess what are termed “jagged” capabilities. AIs capable of answering PhD-level questions may still falter when placed in the physical world. Even a version of Gemini specifically optimized for improved embodied reasoning tasks, Andon researchers observed, performed poorly on the butter-fetching test, indicating “that fine-tuning for embodied reasoning does not seem to radically improve practical intelligence.” The researchers intend to continue developing similar evaluations to assess AI and robot behaviors as they advance in capability—partly to identify as many hazardous errors as possible.

If you have a moment, please complete our brief to help us better understand your interests regarding AI topics and your demographic.

Who to Know: Cristiano Amon, Qualcomm CEO

Another Monday brings another significant chipmaker announcement. This time, it was from Qualcomm, which unveiled two AI accelerator chips yesterday, positioning the company directly against Nvidia and AMD. Qualcomm stock surged 15% following the news. The company stated that the chips will primarily focus on inference—the execution of AI models—rather than their training. Their inaugural client will be Humain, a Saudi Arabian AI firm supported by the country’s sovereign wealth fund, which is establishing large data centers in the region.

AI in Action

An increase in expense fraud is being fueled by individuals employing AI tools to create highly realistic fake receipt images, according to the . AI-generated receipts constituted approximately 14% of the fraudulent documents submitted to the software provider AppZen in September, compared to zero the previous year, the paper reported. Employees are being detected partly because these images frequently contain metadata revealing their fabricated origins.

What We’re Reading

by Yoshua Bengio and Charlotte Stix in TIME

Recently, there has been extensive discussion regarding the possibility that the profits generated by AI might not ultimately benefit companies that train and serve models like OpenAI and Anthropic. Instead—especially if advanced AI becomes a widely available commodity—the majority of the value could flow to hardware manufacturers or to industries where AI is yielding the most significant efficiency gains. This scenario might incentivize AI companies to cease sharing their most advanced models, instead operating them confidentially, in an effort to maximize their upside. Such a move would be perilous, as argued by Yoshua Bengio and Charlotte Stix in a TIME op-ed. If advanced AI is deployed without public access, “unforeseen societal dangers could emerge and evolve without oversight or advance warning—a threat we can and must prevent,” they write.