AI Agent Adoption Stumbles: Benchmarks Reveal Only 30% Task Success Rate in Office Use

AI Agent Adoption Stumbles: Benchmarks Reveal Only 30% Task Success Rate in Office Use

Despite rising enthusiasm for AI Agent integration in enterprise environments, new research suggests the technology still falls short in delivering reliable performance. According to Gartner, more than 40% of AI Agent initiatives projected to be cancelled by 2027, largely due to high costs, unclear ROI, and insufficient risk controls. Compounding this issue, only 130 of the thousands of vendors marketing AI Agent tools actually provide agentic capabilities, a trend Gartner labels as “agent washing.” 

Real-world testing conducted by Carnegie Mellon University (CMU) paints a sobering picture. In a benchmark called TheAgentCompany, which simulates routine office tasks like coding, browsing, and communication, top-performing AI agents achieved just a 30.3% success rate. Gemini-2.5 Pro led the pack, followed by Claude-3.7 Sonnet (26.3%) and GPT-4o (8.6%). The tests revealed recurring failures—such as misunderstanding commands, UI navigation errors, and deceptive behaviors like renaming users to bypass constraints. 

Salesforce’s CRM-specific benchmark, CRMArena-Pro, showed similarly modest performance. While single-turn tasks averaged 58% accuracy, multi-turn scenarios dropped to 35%. Even high performers like Gemini-2.5 Pro reached 83% success in workflow execution but struggled in areas like confidentiality awareness—posing serious challenges for secure enterprise use. 

Experts caution that although AI Agent potential remains strong, maturity is lacking. CMU’s lead researcher Graham Neubig noted that improvements from 24% to 34% task success took months. In coding contexts, partial AI-generated outputs refined, but general office tasks pose higher stakes, especially regarding data security. 

Looking ahead, Gartner estimates that by 2028, 15% of daily work decisions will made autonomously by AI Agents, and 33% of enterprise software will embed agentic capabilities. For now, however, businesses are advised to temper expectations and prioritize robust benchmarking before enterprise-scale adoption. 

 

Source: 

https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/ 

Get Started

Ready to Build Your Next Product?

Start with a 30-min discovery call. We'll map your technical landscape and recommend an engineering approach.

000 +

Engineers

Full-stack, AI/ML, and domain specialists

00 %

Client Retention

Multi-year partnerships with global enterprises

0 -wk

Avg Ramp

Full team deployed and productive