4Policy·5h ago
Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
Researchers have discovered that large language models are vulnerable to safety bypasses when prompts use character-level perturbations that exploit how Byte Pair Encoding tokenization fragments specific words. By breaking safety-critical terms into disjointed sub-word units, these manipulations can cause models to ignore alignment guardrails while remaining understandable to human users. This finding highlights a fundamental structural weakness in current tokenization methods that potentially allows attackers to circumvent safety training protocols.
Covered by 1 source
- AarXiv CS.AI↗Tung-Ling Li, Hongliang Liu, Yuhao Wu5h ago