This Week in AI Security: Jailbreaks Get a Severity Score, Brussels Redraws the Clock

Welcome to This Week in AI Security. This week delivered a rare thing: a full incident-response cycle for a frontier AI model, from jailbreak discovery to government intervention to redeployment — and out of it, the first serious attempt at an industry-wide severity standard for jailbreaks. Meanwhile, regulators on both sides of the Atlantic redrew their timelines, in opposite directions.

The Lead: A Jailbreak Severity Framework Is Born

Anthropic’s Fable 5 is back online — with a “CVSS for jailbreaks” attached. Here’s the short version of a two-week saga: on June 12, the US government placed export controls on Anthropic’s most capable models, Claude Fable 5 and Mythos 5, after researchers at Amazon discovered prompts that bypassed Fable 5’s safeguards and turned it into an effective tool for finding software vulnerabilities. Because Anthropic couldn’t verify user nationality in real time, it suspended access to the models globally rather than risk violating the controls. On June 30, the government lifted the restrictions and Anthropic redeployed both models with reworked safeguards.

The lasting story isn’t the outage — it’s the rubric that came out of it. Alongside the redeployment, Anthropic, Amazon, Microsoft, and Google announced a shared framework for scoring the severity of AI jailbreaks, deliberately modeled on the role CVSS plays for software vulnerabilities. Each jailbreak gets assessed on four axes: capability gain (how far beyond publicly available tools the jailbreak takes an attacker), breadth (how many distinct offensive tasks it unlocks), ease of weaponization (how much human effort converts it into a working attack), and discoverability (how likely others are to find it independently). If you’ve ever triaged CVEs, the logic is familiar: not every finding is an emergency, and a shared score lets defenders — and now AI labs — prioritize consistently and communicate risk to governments in a common language. Anthropic says the most severe class of jailbreak, such as one enabling attacks on critical infrastructure, will trigger immediate fixes, and it has opened a HackerOne program for researchers to report new cyber jailbreaks in Fable 5.

The framework grew out of Project Glasswing, and that context matters. Project Glasswing is the coalition Anthropic launched in April with AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks to point frontier-model capabilities at defending critical software before attackers wield the same capabilities offensively. Anthropic reported the effort has surfaced thousands of previously unknown critical vulnerabilities, including flaws in every major operating system and browser. The Fable 5 incident is the other side of that coin: the same capability that finds bugs for defenders finds them for attackers the moment safeguards slip. That dual-use tension — not any single jailbreak — is what the severity framework exists to manage.

Regulatory Moves

The EU gave final approval to delaying its high-risk AI rules. On June 29, the Council of the EU gave its final green light to the AI Act simplification package (part of the “Omnibus VII” agenda), following the European Parliament’s endorsement earlier in June. The headline change: obligations for high-risk AI systems, originally due August 2, 2026, are pushed back to December 2, 2027 for stand-alone systems and August 2, 2028 for AI embedded in regulated products. The same package adds new prohibitions, banning AI-generated sexual deepfakes and AI-generated child sexual abuse material outright.

But August 2, 2026 is still a real deadline. The deferral covers high-risk obligations — not the Act’s transparency rules. Per the AI Act’s implementation timeline, Article 50 obligations still apply from August 2: users must be told when they’re interacting with an AI system, and AI-generated content must be disclosed. If your organization ships a customer-facing chatbot in the EU, next month’s deadline didn’t move. The practical takeaway: the omnibus bought time for the heavyweight conformity-assessment work, not for basic disclosure hygiene.

In the US, Executive Order 14409’s first deadlines came due this week. The order, signed June 2, gave agencies 30 days — expiring this week — to act on several fronts: CISA is to issue binding operational directives expediting federal cyber defense and expanding AI-enabled defensive tooling, facilitate access to “covered frontier models” for federal agencies, states, and critical-infrastructure operators, and stand up (with Treasury and NSA) a voluntary AI cybersecurity clearinghouse to coordinate vulnerability discovery and patching. Federal News Network reported in late June that CISA was close to issuing the new directive. The 60-day milestones are worth marking now: classified benchmarking of AI cyber capabilities, and a voluntary framework for labs to share frontier models with the government for evaluation before public release — a notable shift given the Fable 5 episode above showed the government intervening after release.

Lab Releases & Research

Claude Sonnet 5 launched with prompt-injection resistance as a selling point. Anthropic released Sonnet 5 on June 30, positioning it as a cheaper mid-size model for running autonomous agents — it plans, browses, and uses terminals on its own. What makes it newsletter-relevant is the security framing: per the system card and Axios’s coverage, Sonnet 5 shows lower rates of cooperating with misuse and deception than its predecessor, and better resistance to prompt-injection hijacks — the attack class where malicious instructions hidden in web pages or documents commandeer an agent. Labs marketing agent models on injection resistance, not just capability, is a shift worth noticing: it signals that agentic deployment is now gated by security properties, and that buyers are asking.

What to Watch

Whether the jailbreak severity framework gets independent governance. A rubric written by four vendors scoring their own products is a start, not a standard. Watch for whether OWASP, NIST, or CISA adopt or formalize it — that’s the difference between a press release and a CVSS.
CISA’s binding operational directives. The EO’s 30-day window closed this week; the directives’ actual scope (and whether “frontier model access” reaches critical-infrastructure operators in practice) will land in the coming days.
August 2 EU transparency compliance. Chatbot and AI-content disclosure obligations take effect in under a month, and the omnibus did nothing to delay them. Expect a scramble — and possibly the first enforcement test cases.
Copycat jailbreak research on redeployed models. Fable 5’s new classifiers are deliberately tuned to over-block as a safety margin. Researchers will probe the new boundaries immediately; the HackerOne program’s first public findings will show whether the defense-in-depth story holds.

The Lead: A Jailbreak Severity Framework Is Born#

Regulatory Moves#

Lab Releases & Research#

What to Watch#

The Lead: A Jailbreak Severity Framework Is Born

Regulatory Moves

Lab Releases & Research

What to Watch