
Researchers at Alibaba are targeting one of the most persistent problems in modern AI agents; knowing when to rely on built-in knowledge and when to call external tools. Their answer is a new reinforcement learning framework, Hierarchical Decoupled Policy Optimisation (HDPO), and a multimodal model called Metis trained with this approach.
According to the researchers, Metis can slash redundant tool calls such as unnecessary web searches or code execution from 98% to just 2%. At the same time, it sets new state-of-the-art reasoning accuracy on key industry benchmarks, suggesting that cutting tool use does not have to mean sacrificing quality.
The work is designed to address what the Alibaba team describes as a “profound metacognitive deficit” in current agentic models. Today’s large language model–based agents often struggle with a basic decision: should they answer from internal (parametric) knowledge, or should they reach out to an external API or tool?
Because many systems are trained to prioritize completing the task at all costs, they routinely default to calling tools even when the user’s prompt already contains enough information. That can mean invoking web search, code execution, or other utilities without genuine need.
This “trigger-happy” behaviour has several consequences for real-world deployments:
- Latency bottlenecks: Every external tool call typically runs in sequence with the model’s reasoning steps. When most calls are unnecessary, these serial bottlenecks accumulate and slow the system down.
- Higher API and infrastructure costs: External calls often translate directly into billable API usage or extra compute cycles. Excessive tool use can quickly inflate operating costs.
- Degraded reasoning from noise: Tool outputs can introduce additional environmental noise. When models depend on these noisy signals even when they don’t need them, reasoning quality can suffer.
HDPO tackles this by explicitly training agents to balance two objectives: execution efficiency and task accuracy. Instead of blindly optimizing for successful completion, the framework encourages models to learn when abstaining from tool use is the better choice.
Metis is the multimodal model Alibaba trained using the HDPO framework. In reported evaluations, Metis cuts redundant tool invocations from 98% to just 2%. At the same time, it achieves new state-of-the-art reasoning accuracy across key industry benchmarks, though the specific benchmarks and scores are not detailed in the available summary.
The results suggest that with the right reinforcement learning setup, AI agents can become more selective about when to reach for external tools. Rather than being “trigger-happy,” Metis aims to make tool calls only when they meaningfully contribute to solving the task, and rely on internal knowledge when that’s sufficient.
For developers and organisations building AI systems, this kind of behaviour has clear potential benefits: more responsive user experiences, lower tool and API bills, and agents that are less prone to being thrown off by noisy external data.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.







