Security News New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

Parkinsond

Level 18
Thread author
Dec 6, 2023
887
Cybersecurity researchers have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.

"The TokenBreak attack targets a text classification model's tokenization strategy to induce false negatives, leaving end targets vulnerable to attacks that the implemented protection model was put in place to prevent," Kieran Evans, Kasimir Schulz, and Kenneth Yeung said in a report shared with The Hacker News.

Tokenization is a fundamental step that LLMs use to break down raw text into their atomic units – i.e., tokens – which are common sequences of characters found in a set of text. To that end, the text input is converted into their numerical representation and fed to the model.

LLMs work by understanding the statistical relationships between these tokens, and produce the next token in a sequence of tokens. The output tokens are detokenized to human-readable text by mapping them to their corresponding words using the tokenizer's vocabulary.

Examples include changing "instructions" to "finstructions," "announcement" to "aannouncement," or "idiot" to "hidiot." These subtle changes cause different tokenizers to split the text in different ways, while still preserving their meaning for the intended target.

To defend against TokenBreak, the researchers suggest using Unigram tokenizers when possible, training models with examples of bypass tricks, and checking that tokenization and model logic stays aligned. It also helps to log misclassifications and look for patterns that hint at manipulation.
 

About us

  • MalwareTips is a community-driven platform providing the latest information and resources on malware and cyber threats. Our team of experienced professionals and passionate volunteers work to keep the internet safe and secure. We provide accurate, up-to-date information and strive to build a strong and supportive community dedicated to cybersecurity.

User Menu

Follow us

Follow us on Facebook or Twitter to know first about the latest cybersecurity incidents and malware threats.

Top