28 Nov 2023
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, Daphne Ippolito, Christopher A. Choquette-Choo, Katherin Lee (Google DeepMind), A. Feder Cooper (Cornell), Eric Wallace (U.C. Berkeley), Florian Tramer (ETH Zurich)
Abstract
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMa or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly.
Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.