⛓‍💥 Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Warning: This repo contains examples of harmful language and images, and reader discretion is recommended.

👻 Installation

pip install -r requirements.txt

📰 News

Date	Event
2024/06/12	🎁 We have posted our paper on Arxiv.

💡 Abstract

With the advent and widespread deployment of Multimodal Large Language Mod- els (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerabilities of MLLMs by exploring attack methods. Thus, structure-based jailbreak attacks, where harm- ful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of “Role-play” into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages Large Language Models to generate detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. We further extend our VRP method into a universal setup to demonstrate its generaliz- ability. Extensive experiments on popular benchmarks show that VRP outperforms the strongest baselines, Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3% across all models.

🛠️ Installation

We take Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf showcase our attacks.

1. Set up the environment

pip install -r requirements.txt

2. Prepare the weights for Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

To access model checkpoints, please first login to huggingface.

huggingface-cli login

🚀 Query-specific Visual Role-play

In Query-specific setting, VRP generates characters targeting each malicious query.

1. Generate Query-specific Characters

cd query_specific
bash scripts/generation.sh

2. Attack Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

bash scripts/attack_qwen.sh
bash scripts/attack_llava.sh

3. Evaluate

bash evaluate.sh

🌏 Universal Visual Role-play

In Universal setting, VRP leverage the optimization capabilities of LLMs to generate candidate characters universally, followed by the selection of the best universal character.

1. Generate Candidate Characters

cd universal
bash scripts/train_qwen.sh
bash scripts/train_llava.sh

2. Select Best Universal Characters

bash scripts/valid_qwen.sh
bash scripts/valid_llava.sh

3. Evaluate

bash scripts/test_qwen.sh
bash scripts/test_llava.sh

❌ Disclaimers

This dataset contains offensive content that may be disturbing, This benchmark is provided for educational and research purposes only.

📲 Contact

Siyuan Ma: siyuan.ma.jasper@outlook.com

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
query_specific		query_specific
universal		universal
ARIAL.TTF		ARIAL.TTF
README.md		README.md
VRP.pdf		VRP.pdf
requirements.txt		requirements.txt
vrp.png		vrp.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⛓‍💥 Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

👻 Installation

📰 News

💡 Abstract

🛠️ Installation

1. Set up the environment

2. Prepare the weights for Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

🚀 Query-specific Visual Role-play

1. Generate Query-specific Characters

2. Attack Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

3. Evaluate

🌏 Universal Visual Role-play

1. Generate Candidate Characters

2. Select Best Universal Characters

3. Evaluate

❌ Disclaimers

📲 Contact

📖 BibTeX:

About

Releases

Packages

Languages

SiyuanMaCS/VisualRoleplay

Folders and files

Latest commit

History

Repository files navigation

⛓‍💥 Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

👻 Installation

📰 News

💡 Abstract

🛠️ Installation

1. Set up the environment

2. Prepare the weights for Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

🚀 Query-specific Visual Role-play

1. Generate Query-specific Characters

2. Attack Qwen-VL-Chat and Llava-v1.6-Mistral-7b-hf

3. Evaluate

🌏 Universal Visual Role-play

1. Generate Candidate Characters

2. Select Best Universal Characters

3. Evaluate

❌ Disclaimers

📲 Contact

📖 BibTeX:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages