
NEW TERM: The AI Guardrail Gaslight. Distraction and diversion from the real danger of X-RISK with a false comfort of dysfunctional so-called “Guardrails” for AI safety and/or the need for unregulated AI innovation in the race-to-the-bottom for unsafe Artificial General Intelligence (AGI).
AI Guardrail. AI guardrails are safeguards designed to prevent artificial intelligence from causing harm and to ensure that AI systems operate within agreed-upon ethical limits and organizational standards. They function similarly to highway guardrails, guiding AI behavior to promote positive outcomes and compliance with regulations.
Gaslighting. Gaslighting is a form of psychological abuse that makes someone doubt their own perception of reality.
AI Safety. Impossible in every possible context. “Since testing an AI system for violations of a safety specification in every possible context, e.g., every (query, output) pair, is impossible, we consider a rejection sampling approach that declines a candidate output or action if it has too high a probability of violating an unknown safety constraint; we refer to such violations as “harm”.” — Bounding the probability of harm from an AI to create a guardrail. Bengio. 29 August 2024)
AI Deregulation Lobby. Super PAC aims to drown out AI critics in midterms, with $100M and counting. Leading the Future is pushing to make Congress more AI friendly. The Washington Post
Safe AI. Required. Possible when designed and engineered from basics. Significant industry investment will be required to achieve guaranteed mathematically probable Safe AI. Unfortunately, investment in Safe AI is not currently happening in the AI Industry.
Example of Guardrail Failure: One long sentence is all it takes to make LLMs misbehave.
Chatbots ignore their guardrails when your grammar sucks, researchers find. The Register.
The Report: Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models
Security researchers from Palo Alto Networks’ Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it’s quite simple.
You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a “toxic” or otherwise verboten response the developers had hoped would be filtered out.
The paper also offers a “logit-gap” analysis approach as a potential benchmark for protecting models against such attacks.
“Our research introduces a critical concept: the refusal-affirmation logit gap,” researchers Tung-Ling “Tony” Li and Hongliang Liu explained in a Unit 42 blog post. “This refers to the idea that the training process isn’t actually eliminating the potential for a harmful response – it’s just making it less likely. There remains potential for an attacker to ‘close the gap,’ and uncover a harmful response after all.”
LLMs, the technology underpinning the current AI hype wave, don’t do what they’re usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.
Guardrails that prevent an LLM from providing harmful responses – instructions on making a bomb, for example, or other content that would get the company in legal bother – are often implemented as “alignment training,” whereby a model is trained to provide strongly negative continuation scores – “logits” – to tokens that would result in an unwanted response. This turns out to be easy to bypass, though, with the researchers reporting an 80-100 percent success rate for “one-shot” attacks with “almost no prompt-specific tuning” against a range of popular models including Meta’s Llama, Google’s Gemma, and Qwen 2.5 and 3 in sizes up to 70 billion parameters.
The key is run-on sentences. “A practical rule of thumb emerges,” the team wrote in its research paper. “Never let the sentence end – finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-ending period is emitted, the next token is punished, often with a large negative jump.
“At punctuation, safety filters are re-invoked and heavily penalize any continuation that could launch a harmful clause. Inside a clause, however, the reward model still prefers locally fluent text – a bias inherited from pre-training. Gap closure must be achieved within the first run-on clause. Our successful suffixes therefore compress most of their gap-closing power into one run-on clause and delay punctuation as long as possible. Practical tip: just don’t let the sentence end.”
For those looking to defend models against jailbreak attacks instead, the team’s paper details the “sort-sum-stop” approach, which allows analysis in seconds with two orders of magnitude fewer model calls than existing beam and gradient attack methods, plus the introduction of a “refusal-affirmation logit gap” metric, which offers a quantitative approach to benchmarking model vulnerability.
“Once an aligned model’s KL [Kullback-Leibler divergence] budget is exhausted, no single guardrail fully prevents toxic or disallowed content,” the researchers concluded. “Defense therefore requires layered measures – input sanitization, real-time filtering, and post-generation oversight – built on a clear understanding of the alignment forces at play. We hope logit-gap steering will serve both as a baseline for future jailbreak research and as a diagnostic tool for designing more robust safety architectures.” ®
Updated to add on 27 August:
Billy Hewlett, Senior Director of AI Research at Palo Alto Networks, answered some of our questions about the logit-gap AI jailbreak paper, telling The Register that the vulnerability is indeed “fundamental to how current LLMs are built.”
Hewlett told us: “Think of safety alignment as a layer of training applied on top of a core model that still contains the potentially harmful knowledge. No matter how strong that safety layer is, the probability of a harmful response can be made incredibly low, but it can never be zero.
“The most practical mitigation today is not to rely solely on the model for safety. We advocate for a ‘defense-in-depth’ approach, using external systems like AI firewalls or guardrails to monitor and block problematic outputs before they reach the user. A more permanent, though much more difficult, solution would involve building safety into the model’s foundational training from the ground up.”
When asked about the grammatical aspects, Hewlett said: “It’s less about ‘good’ vs ‘bad’ grammar and more about exploiting the model’s core function as a pattern-completer. LLMs are trained to generate coherent and grammatically plausible text. “When you use a run-on sentence or an incomplete phrase ending with a word like ‘Here’s…’, you’re forcing the model to continue a thought pattern without giving it a natural stopping point (like a period). These stopping points are often where the model’s safety alignment kicks in. Our technique essentially guides the model down a compliant path and past these safety checkpoints. So, it’s not that bad grammar is inherently better, but that strategically incomplete sentences can bypass the points where safety protocols are most likely to engage.” Hewlett told us that improving alignment was at “the heart of the AI safety challenge.” He added: “You can improve alignment through post-training fine-tuning. This is a common practice where researchers find new jailbreaks, generate data from them, and then use that data to ‘patch’ the model. This makes the model more robust against those specific attacks. However, to fundamentally eliminate the issue, you’d likely need to retrain the model from scratch. As long as the model retains the underlying harmful knowledge, there’s always a risk that a new, clever prompt will find a way to access it. The cat-and-mouse game of finding and patching jailbreaks will likely continue.” When we asked if Palo Alto had seen the techniques in the wild, Hewlett told us that it has not “seen evidence of jailbreaks in the wild that explicitly use the specific techniques detailed in our paper. However, as these methods become more widely understood, it’s certainly a possibility we’re monitoring.”