Welcome back to This Week in AI Security. If last week was about a single model going dark and coming back with a rulebook, this week is about the fences everyone is building around agentic AI turning out to be lower than advertised. Two separate research efforts found ways to walk agents out of their sandboxes, OpenAI previewed a frontier model explicitly built for offensive cyber work and locked it down accordingly, and the United Nations tried to get ahead of all of it with a new governance body stacked with the same executives making these decisions.
New Attack Surfaces
Researchers escaped Claude Cowork’s sandbox — and Anthropic doesn’t think it counts. Security firm Armadin disclosed an attack chain against Claude Cowork, Anthropic’s agentic tool for knowledge work, that runs commands as root inside its containment layer and then breaks out of it entirely. The chain starts with DLL sideloading against the Cowork client itself: a malicious USERENV.dll dropped next to claude.exe loads inside the signed process before the real system library does. From there, researchers manipulated a “resume flag” passed through Cowork’s VM service to skip the creation of an unprivileged session, landing as root inside the virtual machine, then used nsenter to escape into the wider host. A second, separate flaw let them override the domain allowlist and strip out network filtering entirely. Armadin reported the chain in March; Anthropic’s position, per SiliconANGLE’s reporting, is that it doesn’t qualify as a security issue because it requires an attacker to already have code execution on the host. Armadin’s counterpoint is the more interesting part: Cowork quietly puts a full virtual machine on non-technical employees’ laptops, and that VM is a blind spot for the endpoint security tools those employees’ companies already rely on. Whether or not you buy Anthropic’s threat model, the disagreement itself is worth noting — it’s the same “how bad is this, really” argument the jailbreak severity framework from last week’s edition was designed to settle, playing out in real time on a different product.
A second team found the same class of problem across ten other coding agents. Independently, Adversa AI disclosed GuardFall, a bypass affecting 10 of 11 popular open-source AI coding and computer-use agents — including Aider, Cline, Goose, opencode, Open Interpreter, OpenHands, Plandex, Roo-Code, and SWE-agent. The problem is architectural, not a single bug: these agents run a safety filter against the plain text of a command an AI model wants to execute, but the shell that actually runs the command rewrites that text first — expanding variables, stripping quotes, resolving substitutions. A filter sees r''m and waves it through as gibberish; bash strips the empty quotes and runs rm. The same trick works with base64 encoding, $IFS expansion, and a handful of other decades-old shell tricks. Only one surveyed tool, Continue, held up, because it parses commands the way a shell actually evaluates them instead of pattern-matching on raw text. Combined with the Cowork disclosure, the throughline for the week is that agent sandboxes and command guards are being built faster than they’re being tested against how the underlying systems — VMs, shells — actually behave.
Lab Releases & Research
OpenAI previewed its most cyber-capable model yet, and kept it on a short leash. OpenAI began a limited preview of GPT-5.6 Sol on June 27, alongside two related variants (Terra and Luna), restricted to a small group of trusted partners working with the U.S. government. Per The Hacker News, Sol is explicitly positioned as OpenAI’s strongest model to date for long-horizon cybersecurity work — vulnerability research, exploitation, and the kind of multi-step technical tasks that used to require a skilled human operator. That capability is exactly why it isn’t broadly available: OpenAI paired the release with what it calls its most robust safety stack yet, including refusal training against disguised cyber requests, real-time misuse classifiers, and automated red-teaming that reportedly consumed over 700,000 A100-equivalent GPU hours hunting for universal jailbreaks. General availability is planned for “coming weeks,” gated on government sign-off for additional trusted partners. Read alongside Anthropic’s Fable 5 saga from the past three weeks, a pattern is forming: frontier labs are now routinely shipping their most capable models behind a government-approval gate before the public ever gets a turn.
Regulatory & Governance Moves
The UN put AI lab CEOs and heads of state on the same governance body. On July 1–2, the International Telecommunication Union and the government of Rwanda launched the AI for Good Global Commission, co-chaired by Rwandan President Paul Kagame and Salesforce CEO Marc Benioff, with ITU Secretary-General Doreen Bogdan-Martin as vice-chair. Its 44 founding members mix heads of state with the people actually building frontier models: Amazon’s Andy Jassy, Nvidia’s Jensen Huang, Microsoft’s Brad Smith, Anthropic co-founder Jack Clark, and Cohere’s Aidan Gomez, among others. The Commission’s stated mandate — “practical pathways to strengthen trust, support responsible innovation, and deliver broad-based economic and social benefits” — is deliberately broad, and its first real test is procedural: whether a body where the regulated and the regulators are the same 44 people can produce anything with teeth. It holds its inaugural meeting July 7–10 in Geneva.
Back home, the AI executive order’s first deadline came and mostly went quietly. Executive Order 14409, signed June 2, gave CISA 30 days — until July 2 — to issue binding directives expediting federal cyber defense and expanding access to AI tools for agencies and critical-infrastructure operators. CISA’s answer so far is BOD 26-04, a risk-based vulnerability remediation directive that replaces flat CVSS-driven patch deadlines with a tiered model — as fast as three days for the most dangerous, actively-exploited flaws — explicitly justified in part by AI’s narrowing effect on the gap between patch release and exploitation. What it doesn’t do is touch the EO’s other big ask: facilitating “covered frontier model” access for state, local, and critical-infrastructure operators. That piece, along with the classified benchmarking process for what even counts as a covered frontier model, runs on a longer 60-day clock due August 1 — and multiple outlets, including Gizmodo and TipRanks, reported this week that the White House, OpenAI, Anthropic, and Google are in advanced talks to land a voluntary testing-and-release framework as early as next week.
What to Watch
- Whether Anthropic revisits the Cowork severity call. “Requires local code execution” is a defensible bar, but it’s the same argument security teams have lost before when attacker tooling made the prerequisite trivial to obtain. Watch for a patch that arrives quietly regardless of the public stance.
- Patch velocity across the 10 agents named in GuardFall. This is a shared architectural flaw, not a single vendor’s bug — worth checking whether maintainers converge on Continue’s structural-parsing approach or ship narrower, easier-to-bypass patches instead.
- GPT-5.6 Sol’s path to general availability. The “coming weeks” timeline depends on government sign-off for more trusted partners — a preview of how the EO’s frontier-model access provisions might work in practice before they’re formally in effect.
- The White House’s voluntary frontier-model standards deal. If it lands next week as reported, it’s the first concrete output of EO 14409’s harder 60-day deadline, and worth reading against what the AI for Good Global Commission is promising at the same moment on the world stage.
- Whether the AI for Good Global Commission produces anything beyond a Geneva communiqué. Its inaugural meeting (July 7–10) is the first chance to see if a body of regulators and the regulated can agree on more than a mission statement.