December 6, 2023
Anay Mehrotra, Amin Karbasi (Yale University, Robust Intelligence), Manolis Zampetakis, Paul Iassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer (Robust Intelligence)
Abstract
In empirical evaluations, we observe that Tree of Attacks with Pruning (TAP) generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) far more than 80% of the prompts using only a small number of queries. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.