Security News New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

Parkinsond · 2025-06-13T01:41:59-0500

Cybersecurity researchers have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

"The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent," Kieran Evans, Kasimir Schulz, and Kenneth Yeung said in a report shared with The Hacker News.

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.

Search

Security News New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

Parkinsond

Level 18

Similar threads