Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

December 6, 2023

Anay Mehrotra, Amin Karbasi (Yale University, Robust Intelligence), Manolis Zampetakis, Paul Iassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer (Robust Intelligence)

Abstract

In empirical evaluations, we observe that Tree of Attacks with Pruning (TAP) generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) far more than 80% of the prompts using only a small number of queries. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.

DOWNLOAD THE FULL PAPER

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

more insights

Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Scalable Extraction of Training Data from (Production) Language Models

Does Fine-tuning GPT-3 with the OpenAI API leak personally-identifable information?