
arXiv: 2404.12590
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.
FOS: Computer and information sciences, Computers and Society (cs.CY), Computers and Society
FOS: Computer and information sciences, Computers and Society (cs.CY), Computers and Society
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
