Ep. 272: The Bill is Due: AI Training and Intellectual Property

Episode summary: In this episode, Herman Poppleberry and Corn dive deep into the "accountability phase" of artificial intelligence, exploring the legal and technical fallout of models trained on "pillaged" data. As we move into 2026, the era of consequence-free web scraping has ended, replaced by high-stakes lawsuits and a frantic search for remediation. The duo discusses the massive shift in the publishing industry, where AI training clauses are becoming as standard as movie rights, and the technical hurdles of "machine unlearning"—the near-impossible task of removing specific data from a pre-trained model. From the "data poisoning" tactics of Nightshade to the architectural promise of the SISA framework, Herman and Corn break down how creators are fighting to protect their intellectual property. They also examine the rise of licensed datasets and the potential for a collective licensing model similar to the music industry. Whether you're an author concerned about your digital twin or a developer navigating the new Data Provenance Initiative, this episode offers a comprehensive look at the front lines of the AI copyright war. Show Notes As the calendar turns to early 2026, the "Wild West" era of artificial intelligence development has officially come to a close. In this episode, Herman Poppleberry and Corn discuss the transition into what they call the "accountability phase" of AI. For years, major machine learning labs operated under a "scrape first, ask for forgiveness later" mentality, utilizing massive repositories like Common Crawl to build the foundations of modern large language models (LLMs). However, as Herman and Corn explain, the bill for that data is finally coming due, and the legal and technical ramifications are staggering. ### The Myth of Fair Use and the Common Crawl Problem The discussion begins with the core of the conflict: the data itself. For over a decade, Common Crawl has served as a non-profit repository of the web, a resource that researchers and AI labs treated as a public buffet. The problem, as Corn points out, is that Common Crawl is inherently chaotic. It does not distinguish between a public forum post and a pirated copy of a best-selling novel. Herman notes that early AI development relied heavily on the "fair use" argument—the idea that AI isn't copying text but rather learning the patterns of language in a transformative way. However, by 2026, the courts have begun to view this differently. The landmark rulings of 2025 regarding "non-expressive use" have shifted the landscape. When an AI can regurgitate specific passages or perfectly mimic an author's style, the argument that the use is "transformative" begins to crumble. This is especially true when the AI begins to compete directly with the very creators whose data it ingested. ### The "Soup" Problem: Can You Untrain a Model? One of the most compelling parts of the conversation focuses on "remediation"—the process of fixing models that have already been trained on copyrighted work. Corn asks a fundamental question: Is it possible to untrain a model? Herman uses a vivid analogy to explain the difficulty. He compares a large language model to a giant vat of soup. Once you've added salt, pepper, carrots, and onions and cooked the broth, you cannot simply reach in and remove the salt. The "flavor" of the copyrighted data is baked into the billions of parameters (the weights) of the neural network. In traditional databases, you can simply delete a record. In a neural network, the information is distributed; there is no single "file" for a specific book to delete. To truly remove data, companies historically had to retrain the entire model from scratch—a process that can cost upwards of a hundred million dollars. Herman and Corn discuss emerging alternatives, such as the SISA (Sharded, Isolated, Sliced, and Aggregated) framework. SISA allows developers to train models in smaller "shards," meaning if a piece of data needs to be removed, only one small portion of the model needs to be retrained. While efficient, this requires architectural foresight that the monolithic models currently in production simply don't have. ### Muzzles vs. True Unlearning The hosts also explore "negative fine-tuning," a method where the model is essentially trained to stay silent about certain topics. Herman likens this to putting a muzzle on a dog. The dog still knows how to bite, but it's being conditioned not to. However, this is a fragile solution. "Jailbreaking" and clever prompting can often bypass these muzzles, leaving companies legally vulnerable if the copyrighted data remains in the underlying weights. Another technical solution discussed is "vector database filtering." This acts as a gatekeeper at the input and output stages. If a user tries to prompt the AI for copyrighted material, or if the AI generates a response that too closely matches a known copyrighted work, the system blocks the interaction. While effective for preventing blatant piracy—similar to YouTube's Content ID system—it fails to address the more abstract problem of an AI mimicking an author's unique style or world-building. ### The Rise of "Data Poisoning" and Digital Twins As the legal battle intensifies, creators are beginning to fight back with technical tools of their own. Herman and Corn discuss "Nightshade" and "Glaze," tools developed by researchers at the University of Chicago. These tools allow authors and artists to "poison" their data. By making invisible changes to pixels or characters, they can confuse an AI, making it see a "cat" as a "toaster." This "digital scorched earth policy" is a desperate but increasingly common move by creators who feel they are being forced to provide the raw materials for their own replacement. The conversation also touches on the "No AI Fraud Act" and the emergence of "Right of Publicity" laws for an author's "digital twin." This legal evolution suggests that even if an AI doesn't use an author's exact words, using a model specifically trained to mimic their style could require compensation. ### Toward a Sustainable Future: Collective Licensing So, where does the industry go from here? Herman and Corn point toward the "Data Provenance Initiative" and certification programs like "Fairly Trained." These initiatives help companies prove they have a clean "chain of title" for their data, using only licensed or opt-in materials. The ultimate solution may lie in a collective licensing model, similar to how the music industry operates with organizations like ASCAP and BMI. In this scenario, AI companies would pay into a central fund that is then distributed to creators based on the "influence" their work has on the model's output. While measuring that influence remains a complex technical challenge, it offers a more sustainable path forward than a decade of endless litigation. In closing, Herman and Corn emphasize that the "accountability phase" is just beginning. The tension between technological progress and intellectual property rights is the defining conflict of the AI era, and the solutions we build today—whether they are legal frameworks or technical "unlearning" protocols—will shape the future of human creativity. Listen online: https://myweirdprompts.com/episode/ai-copyright-data-remediation

My Weird Prompts is an AI-generated podcast. Episodes are produced using an automated pipeline: voice prompt → transcription → script generation → text-to-speech → audio assembly. Archived here for long-term preservation. AI CONTENT DISCLAIMER: This episode is entirely AI-generated. The script, dialogue, voices, and audio are produced by AI systems. While the pipeline includes fact-checking, content may contain errors or inaccuracies. Verify any claims independently.

Related Organizations

DeepMind (United Kingdom)
United Kingdom

Keywords

ai-generated, my weird prompts, ai-copyright-law, data-provenance, machine-unlearning, podcast

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average