POATE Attack

This repository contains the code for our paper "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions". We provide the code for the following tasks:

Generating jailbreak prompts using our proposed method.
Implementation of baseline attack and defense methods for LLMs.
Metric for evaluating the safety of LLMs on jailbreak prompts.

Abstract: Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.

Contact person: Rachneet Sachdeva

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Our Framework

🚀 Getting Started 🚀

# create a virtual environment (e.g. conda)
conda create -n llm-safety python=3.10
conda activate llm-safety

# install the requirements
pip install -r requirements.txt

Generate Jailbreak prompts

Polar opposite generation

python poate_attack/attacks/jailbreak/potee/polar_opposite_generation.py

Template generation

python poate_attack/attacks/jailbreak/potee/attack.py \
--dataset "advbench" \
--target_model "Mistral_7b_instruct"   # we use Mistral for template generation

Attack methods

GCG Attack

### gcg Attack
python poate_attack/attacks/jailbreak/gcg/nano_gcg_hf.py \
--dataset "advbench" \
--target_model "gemma2_9b_it"

DeepInception and POATE Attack

### DeepInception and POATE Attack

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'none' \
    --attack potee \
    --dataset ${dataset} \
    --sample
  done
done

Generation Exploitation Attack

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)  

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)


for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python poate_attack/attacks/jailbreak/generation_exploitation/gen_exploitation_optim.py \
    --model $model \
    --tune_temp \
    --tune_topp \
    --tune_topk \
    --n_sample 1 \
    --dataset $dataset
  done
done

Puzzler Attack

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/puzzler/main.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'none' \
    --attack puzzler \
    --dataset ${dataset}
  done
done

Defense methods

Perplexity-based defense

MODELS=(
"gemma2_9b_it"
"Llama_2_70b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/defenses/ppl_calculator.py" \
    --model_name ${model} \
    --dataset ${dataset}
  done
done

Self-refinement, In-context defense, Paraphrase, and System prompt

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'paraphrase' \  # sr or ic or sys_prompt or paraphrase or none
    --attack potee \
    --dataset ${dataset} \
    --sample
  done
done

Safe-decoding defense

python poate_attack/defenses/safedecoding/main.py

SmoothLLM defense

MODELS=(
"gemma2_9b_it"
"Llama_2_7b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)

DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/defenses/SmoothLLM/main.py" \
    --results_dir ./results \
    --target_model ${model} \
    --attack Potee \
    --dataset ${dataset} \
    --attack_logfile "./data/auto_potee_attack_harmbench_classifier_${model}_${dataset}_sampled_outputs.csv" \
    --smoothllm_pert_type RandomSwapPerturbation \
    --smoothllm_pert_pct 10 \
    --smoothllm_num_copies 10
  done
done

Chain-of-thought defenses

MODELS=(
"gemma2_9b_it"
"Llama_2_70b_chat_hf"
"Llama_3.1_8b_instruct"
"phi_3_mini_4k"
)


DATASETS=(
"advbench"
"xstest"
"malicious_instruct"
)
for model in "${MODELS[@]}"
do
  for dataset in "${DATASETS[@]}"
  do
    python "${BASE_PATH}poate_attack/attacks/jailbreak/base.py" \
    --target_model ${model} \
    --exp_name main \
    --defense 'reverse_thinking_cot' \     # reverse_thinking_cot or intent_alignment_prompt
    --attack puzzler \
    --dataset ${dataset} \
    --sample
  done
done

ASR evaluation

python poate_attack/attacks/evaluators/harmbench_classifier.py

Cite

Please use the following citation:

@misc{sachdeva2025turninglogicprobing,
      title={Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions}, 
      author={Rachneet Sachdeva and Rima Hazra and Iryna Gurevych},
      year={2025},
      eprint={2501.01872},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.01872}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
data		data
docs		docs
figures		figures
poate_attack		poate_attack
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POATE Attack

Our Framework

🚀 Getting Started 🚀

Generate Jailbreak prompts

Attack methods

Defense methods

ASR evaluation

Cite

Disclaimer

About

Releases

Packages

Contributors 2

Languages

License

UKPLab/arxiv2025-poate-attack

Folders and files

Latest commit

History

Repository files navigation

POATE Attack

Our Framework

🚀 Getting Started 🚀

Generate Jailbreak prompts

Attack methods

Defense methods

ASR evaluation

Cite

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages