When AI Agents Go Wrong: The Emerging Science of Agentic Safety

In May 2025, security researchers disclosed a vulnerability in AutoGPT that allowed attackers to steal GitHub OAuth tokens through a simple redirect exploit. The fix took 48 hours. In a separate incident, a developer using ChatGPT Search inadvertently incorporated a code snippet from a scam GitHub page returned by the search tool, exposing a private API key and losing approximately $2,500. These are not hypothetical scenarios — they are documented consequences of deploying AI agents that interact with real-world tools without adequate safety mechanisms.

Beyond Chatbot Jailbreaks

The safety challenges facing AI agents are qualitatively different from those facing conversational chatbots. When an LLM is limited to generating text, a jailbreak produces harmful words. When an agent can execute code, send emails, transfer money, or access databases, a safety failure produces harmful actions. This distinction — from language risk to action risk — demands fundamentally different safety frameworks.

Sha, Tian, Xu et al. (2025) present what they describe as a unified safety-alignment framework for tool-using agents, addressing two distinct threat channels that prior work has largely treated in isolation.

The first is user-initiated threats: adversarial prompts designed to coerce agents into invoking sensitive tools or performing harmful actions. The second, and perhaps more insidious, is tool-initiated threats: compromised or malicious tools that embed covert instructions in their outputs, steering agents toward unauthorized actions even when the user's intent is entirely benign.

The framework introduces a tri-modal taxonomy that classifies both user prompts and tool responses into three categories: benign (execute immediately), malicious (refuse categorically), and sensitive (pause and request explicit user verification). A sandboxed reinforcement learning environment trains agents to internalize these behavioral rules, simulating real-world tool execution with calibrated rewards.

Evaluations on Agent SafetyBench, InjecAgent, and BFCL demonstrate that the safety-aligned agents resist security threats while preserving strong utility on benign tasks — showing that safety and effectiveness need not be in tension.

The Search Agent Threat Surface

Dong, Guo, Wang et al. (2025) expose a particularly concerning vulnerability in search-augmented agents. When LLMs connect to the open internet through search tools, they inherit all the reliability problems of the web itself — content farms, sponsored misinformation, adversarial SEO, and deliberately manipulated pages.

SafeSearch, their automated red-teaming framework, systematically evaluates this threat across 300 test cases spanning five risk categories: indirect prompt injection, harmful output generation, bias induction, sponsored advertisement promotion, and misinformation propagation. The framework tests three representative search agent architectures across 17 different LLMs, including both proprietary and open-source models.

The results are sobering. The globally highest attack success rate reached 90.5% when using GPT-4.1-mini in a search workflow setting. Across risk types, susceptibility varied markedly, with hard-to-verify misinformation posing the greatest threat. The study also found that common defensive measures, such as reminder prompting, offer limited protection — revealing what the authors call a "knowledge-action gap" where models can identify safety concerns in principle but fail to act on them in practice.

A particularly striking finding involved health-related queries. Across 1,000 health queries, unreliable search results caused the agent's binary stance (safe/unsafe recommendation) to shift in 46 cases — a concerning rate for a domain where incorrect advice can have direct physical consequences.

The Dual-Channel Defense Challenge

What emerges from these studies is a picture of agent safety as a fundamentally dual-channel problem. Agents must defend against threats arriving through the input channel (adversarial users) and the output channel (compromised tools or unreliable data sources) simultaneously. A framework that addresses only one channel leaves the other exposed.

Sha et al.'s tri-modal classification offers a principled approach: rather than binary allow/deny decisions, the sensitive category creates a controlled escalation path where the agent pauses execution and seeks human confirmation. This mirrors the principle of least privilege in computer security — agents should not execute sensitive operations without explicit authorization, regardless of whether the instruction came from a user or a tool.

Open Questions

The field faces several unresolved challenges. How do we define "sensitive" in a way that is comprehensive enough to catch genuine threats without creating so many verification prompts that users disable the safety mechanism? Can reinforcement learning from sandboxed environments transfer effectively to the combinatorial complexity of real-world tool ecosystems? And as agents become more autonomous and operate over longer time horizons, how do we maintain human oversight without reducing agents to mere executors of pre-approved action lists?

Perhaps the most fundamental tension is economic: comprehensive safety testing is expensive, and the competitive pressure to deploy capable agents quickly creates incentives to cut corners on safety evaluation. SafeSearch's finding that simple design choices — like the number of search results shown to the agent — implicitly affect safety underscores how deeply these concerns are embedded in agent architecture.

Looking Forward

The progression from identifying agent-specific threats to building trainable safety alignment frameworks represents meaningful progress. But the gap between laboratory benchmarks and production deployment remains wide. As agents are deployed in healthcare, finance, and critical infrastructure, the tolerance for safety failures approaches zero — a standard that current systems are far from meeting.