TLDR: A new framework called Learning to Detect (LoD) has been developed to accurately and efficiently identify unknown jailbreak attacks in Large Vision-Language Models (LVLMs). Unlike previous methods that either struggle with generalization or accuracy, LoD learns task-specific parameters using only safe and unsafe inputs (without attack data), employing a Multi-modal Safety Concept Activation Vector (MSCA V) for safety representation and a Safety Pattern Auto-Encoder (SPAE) for unsupervised attack classification. Experiments show LoD significantly outperforms existing methods in detecting diverse, unseen attacks while being more computationally efficient.
Large Vision-Language Models (LVLMs) are powerful AI systems that can understand and process both images and text. However, despite significant efforts to make them safe, these models remain vulnerable to what are known as ‘jailbreak attacks’. These attacks trick LVLMs into generating unsafe or undesirable content, posing serious safety risks.
Traditional methods for detecting these attacks often face a dilemma: they either learn specific patterns of known attacks, which makes them ineffective against new, unseen attacks, or they rely on general rules that, while applicable to unknown attacks, tend to be less accurate and efficient. This creates a significant challenge in ensuring the long-term safety of LVLMs.
To address these limitations, researchers have introduced a novel framework called Learning to Detect (LoD). This framework shifts the focus from learning about specific attacks to learning the fundamental task of detection itself. The goal is to accurately identify unknown jailbreak attacks without ever having seen them during training.
The LoD framework is built upon two key modules:
Multi-modal Safety Concept Activation Vector (MSCA V)
This module is responsible for learning safety-oriented representations of the input. Imagine an LVLM processing an input – the MSCA V module analyzes the model’s internal ‘thoughts’ (activations) at different layers to estimate how likely the model considers the input to be unsafe. It’s designed to filter out irrelevant information and focus only on signals related to safety. Crucially, this module is trained using only safe and genuinely unsafe inputs, not inputs that have been modified by jailbreak attacks. This ensures that it learns a pure understanding of safety without being biased by attack-specific patterns.
Also Read:
- Improving Multimodal AI: Understanding How Modalities Work Together
- Securing Large Language Models: A New Framework for Understanding and Evaluating Prompt Security
Safety Pattern Auto-Encoder (SPAE)
Once the MSCA V module generates a safety-oriented representation, the SPAE module takes over for attack classification. This module is trained exclusively on the MSCA Vs of *safe* inputs. It learns to recognize the ‘normal’ patterns of safe inputs. During testing, if an input’s MSCA V deviates significantly from these learned safe patterns, the SPAE flags it as an anomaly – indicating a potential jailbreak attack or an unsafe input. This approach leverages anomaly detection principles, allowing it to identify attacks it has never encountered before.
Extensive experiments were conducted on three different LVLMs (LLaVA-1.6-vicuna-7B, CogVLM-chat-v1.1, and Qwen2.5-VL-7B-Instruct) and five diverse jailbreak attack methods, including both prompt manipulation and adversarial perturbation techniques. The results showed that the LoD framework consistently achieved significantly higher detection accuracy (AUROC) compared to existing state-of-the-art methods. For instance, it improved the minimum AUROC across diverse unknown attacks by up to 62.31% on LLaVA. Beyond accuracy, LoD also demonstrated superior computational efficiency, reducing detection time by up to 62.7% compared to the best baselines.
An ablation study confirmed that both the MSCA V and SPAE modules are essential for LoD’s high performance, with removing either leading to a substantial drop in accuracy. The framework also proved to be robust to different parameter settings, maintaining stable detection performance over a wide range.
While LoD represents a significant step forward in detecting unknown jailbreak attacks, the researchers acknowledge that its performance can still depend on the specific LVLM architecture. Future work will explore how to make it even more resilient against highly adaptive attacks that might specifically target internal representations. For more technical details, you can read the full paper here.