Huawei unveils Claw-Anything benchmark, revealing AI agents’ limitations in personal assistant tasks
Even GPT-5.5 managed only a 34.5% success rate on Huawei's new test, exposing just how far AI personal assistants are from being truly useful.
Your AI assistant can summarize a PDF and set a timer. Ask it to manage your actual digital life across multiple devices, services, and days of accumulated context, and things fall apart fast. That’s the uncomfortable conclusion from Huawei’s new Claw-Anything benchmark, which simulates the messy reality of being a human with a phone, a laptop, and too many apps.
The benchmark, published as a preprint on arXiv on May 25, was developed by Huawei researchers alongside teams from Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences’ Institute of Automation. Its purpose is straightforward: test whether AI agents can function as always-on personal assistants in environments that actually resemble real life.
The results are humbling
GPT-5.5, currently among the most capable large language models available, scored a 34.5% pass@1 rate on Claw-Anything. In English: when given one shot at completing a realistic personal assistant task, the model failed roughly two out of every three times.
That number looks even worse when you compare it to how these models perform on more constrained benchmarks. Previous evaluations like ClawBench, which tested AI agents on 153 everyday online tasks, saw top models scoring between 33% and 44%. But those tests were simpler, more isolated, and less reflective of how people actually use digital tools.
Claw-Anything raises the bar considerably. It evaluates AI agents across three dimensions: long-horizon user activity histories (think weeks of accumulated digital behavior), multi-service backend dependencies (where one task’s completion depends on another service responding correctly), and integrated multi-device interactions spanning both graphical and command-line interfaces.
A training pipeline that actually helps
Alongside the benchmark itself, the research team built an automated data generation pipeline that produces 2,000 training environments. The idea is to give developers a way to fine-tune their models against more realistic conditions rather than the sanitized datasets that have become standard.
The results from this pipeline are encouraging, at least relatively. The Qwen3.5-27B model showed a 23.7% improvement in successful task completions after being fine-tuned using these generated environments.
The benchmark sits within a broader ecosystem of “Claw”-branded open-source AI agent projects that gained momentum starting in 2025. OpenClaw, a related platform, has attracted hundreds of thousands of GitHub stars and serves as a foundation for several evaluation frameworks. WildClawBench, another project in this family, uses the OpenClaw environment but operates independently from Huawei’s Claw-Anything effort.
What this means for the AI and crypto landscape
This benchmark doesn’t have a token. There’s no governance DAO, no staking mechanism, and no airdrop. Huawei’s initiative is firmly rooted in traditional AI research and evaluation, with no alignment to decentralized or token-driven frameworks.
That distinction matters because the AI-crypto intersection has become one of the hottest narratives in digital asset markets over the past two years. Projects promising autonomous AI agents that can trade, manage portfolios, or interact with DeFi protocols have attracted billions in combined market capitalization.
Claw-Anything suggests the gap between current AI capabilities and the autonomous agent future that many crypto projects are selling is wider than token prices might imply. If the best available models can’t reliably manage a simulated personal assistant workflow, a 34.5% success rate on personal assistant tasks doesn’t inspire confidence in an AI agent managing DeFi positions across multiple chains.
The fine-tuning pipeline demonstrates meaningful improvement is possible with better training data and more realistic environments. The 23.7% gain from Qwen3.5-27B after fine-tuning suggests that the problem isn’t architectural dead-ends but rather the quality and realism of training conditions.
Earn with Nexo