4Research·5h ago
kNNGuard: Turning LLM Hidden Activations into a Training-Free Configurable Guardrail
Researchers have introduced kNNGuard, a method that uses hidden activations from large language models to identify unsafe or off-topic prompts without requiring additional model training. By mapping these internal representations to known examples of problematic content, the technique provides a configurable defense mechanism that avoids the computational costs and limitations associated with fine-tuning external classifiers.
Covered by 1 source
- AarXiv CS.AI↗Mahmoud Abdelfattah, Hamid Nasiri, Peter Garraghan5h ago