Does generative AI really present a gaggle of copyright brainteasers? Or are the answers relatively simple, based on old law? Below are the three key questions roiling the industry and the courts – along with their answers, IMHO.
- Do generative AI operators infringe copyright when they train their models (machine learning) on copyrighted content freely available to the public? NO.
- Do gen-AI operators infringe copyright when they use facts, ideas, and styles from copyrighted training data to produce original content? NO.
- Do gen-AI operators infringe copyright when they produce output substantially similar to copyrighted training data? YES.
My answer to question 3 will be unpopular with many gen-AI providers. My answers to questions 1 and 2, on the other hand, support their arguments. But I come at the problem from a different direction (at least, different from the arguments I’ve seen). It all boils down to a comparison between machine learning and the human brain.
I don’t infringe your copyright when I store your words in my brain.
Let’s say you write an article protected by copyright, and I read it – but I don’t keep an electronic or hard copy. I do, however, retain a lot of your words … in my brain. I have a good memory, so I could probably write a similar article using roughly the same words. And if I have a photographic memory, I retain all your words and could write a nearly identical article. (Either article would infringe your copyright, but more on that later.)
How do I retain so much of your expression in my brain? A neuroscientist might say something about new or stronger neural pathways – and possibly throw in terms like “synaptic plasticity” and “action potential.” But science has a weak grasp on the process. Do my neural pathways save a mental photocopy of your words? Does a copy exist in my brain before I try to remember your words, or are they reassembled only when needed? We don’t know.1The Queensland Brain Institute offers a short summary of what we do know, or at least some of it: “How are memories formed?”
Fortunately, we don’t need to know for copyright purposes. England’s Queen Elizabeth I once said, “I have no desire to make windows into men’s souls.” Copyright law follows her wise example. It does not make windows into our mental data storage systems. Whatever “copying” system my brain uses for your words – or your song, painting, photo, etc. – it doesn’t infringe your copyright.2Coincidentally, modern copyright law began with another of England’s few reigning queens. In 1710, just over a century after QE-1 died, Parliament passed The Statute of Anne, under Elizabeth’s cousin many times removed: Queen Anne of Great Britain and Ireland.
Machine learning doesn’t infringe either.
It’s just about as hard to explain how a large language model or other machine learning system uses training data. Does it make a copy? Some AI engineers will say “no.” But keep asking questions and you’ll end up dubious … or confused.
I tried to get an answer from ChatGPT (3.5). ChatGPT also said “no” to copying training data, sort of. It said its model, “doesn’t store explicit information about particular sentences” (from an article in training data). Rather, “[t]he retention is more abstract and focused on general patterns ….” Confused, I kept asking questions. ChatGPT eventually conceded that it could reproduce a human-written article word-for-word as a result of, “the model memorizing or overfitting to that specific content during training.” So the model does not store explicit information, but it can “memorize” or “overfit” content and reproduce it word-for-word.
What? If you’re confused, good: you’re paying attention.3Click here to read my full conversation with ChatGPT. And for more concerns about whether gen-AI copies training data, see my earlier article, “Watermarks on Generative AI Art … and Copyright.”
Why can’t an engineer or a gen-AI system give clear answers about whether machine learning copies training data? Maybe we shouldn’t be surprised, since we can’t answer that question about our brains. Perhaps the vocabulary doesn’t exist yet.
Why do we care, at least from a copyright point of view? Whatever weird, baffling thing machine learning does with training data, it has no more impact on the original author/artist than my learning process does on you when my brain absorbs your article. Assuming I get legal access to your work – it’s published online, for example, or I buy a copy – I don’t measurably alter your market by learning from your words, however I may mentally record them. The same goes for gen-AI. So machine learning is not copyright infringement. (Call it “neural fair use”?)
Illegal publication is a whole other story
What about training gen-AI on copies of your work published or distributed illegally? What if an evil corporation scans your book and publishes it online, without a license? The evil corporation has infringed your copyright. And that illegal copy could represent a source of liability for a gen-AI company that uses it for training. If the company encourages the publication – or even knowingly exploits unlicensed copies – it may have committed contributory infringement of your copyright.
To put it another way, freely available content is fine. But if I have to pay to read your book – or view your painting, hear your song, etc. – the gen-AI vendor probably has to pay to use it for training.
Illegal publication, however, gets very complicated, and it’s not my core topic. Let’s turn back to training data online or otherwise freely available for public use.
When I use your ideas to create unique content, I still don’t infringe your copyright – and neither does gen-AI.
If I like your online article, I can write my own using your ideas but not your words. That’s not copyright infringement. (I should throw you a cite or two, but copyright doesn’t require it.) If you write a song or paint an image of something unique – like a coyote driving a golf cart or a McDonald’s on Pluto – I can create my own song or painting on the exact same subject-matter. If I don’t copy your lyrics, notes, lines, or shades, I don’t infringe. Copyright protects expression, not ideas.
The same logic should apply to AI. A machine learning system can take the facts and ideas in its training data and repackage them, using different words, lines, notes, etc. For instance, it can absorb a news article and then generate output reporting the same news, using different words. That output doesn’t infringe anyone’s copyright.
You might point out that it’s hard to draw a line between expression and idea. What if the AI uses my ideas and some of my words – or notes, lines, shades, etc.? Yes, that raises hard issues. But copyright law already has rules to separate idea from expression. We just have to apply them, as per usual.
You might also argue that gen-AI doesn’t work with facts or ideas, only words, lines, and notes. So any similarity results from reproducing copyrightable stuff – words, lines, and notes – not ideas. I think that’s a philosophical dead end. We don’t understand sophisticated machine learning or human brains enough to explain what either system really records. And we don’t need to. We have perfectly good copyright laws that haven’t ever needed answers to those questions.
When I distribute your words, lyrics, notes, image, etc., I do infringe your copyright – and the same goes for gen-AI.
Regardless of how I “recorded” your article in my brain, I do infringe your copyright when I distribute, publish, or otherwise reproduce your words – or the notes of your song, the lines and shades of your painting, etc.
The law doesn’t care about the route your words took from your article through my brain’s mysterious storage system to my own article. It asks (a) did I have access to your work and (b) did I create something substantially similar (to protected elements of your work)? If so, I’ve infringed your copyright.
The same goes for gen-AI … or should. Hard questions about the route from training data to AI output don’t matter. The law should ask the same two questions: (a) did the gen-AI have access to the copyrighted content, as training data, and (b) is the output substantially similar? If so, the AI has infringed copyright. Or to be more precise, the AI’s operator has infringed. (Liability probably lands on both the AI company and the user – the latter possibly a humble consumer. The copyright plaintiff will undoubtedly sue the deep pocket.)
This isn’t computer law; it’s people law.
Keep in mind, no one (sane) sues a computer. We sue people for copyright infringement. We sue the operators of the generative AI system.
So we’re talking about humans copying ideas, expression, or both, with AI serving as a tool. (It’s not a sentient decision-maker like R2-D2, WALL-E, or Ultron.) Copyright law already covers human infringement. Why do we need new legal principles?
After all, we dreamed up market-share liability.
If I’m right that a gen-AI output can infringe copyright in some cases, the industry has a problem. It will get solved. I don’t know how the law will compensate creators for infringement, but there is no way we’re going to shut down the incredibly valuable gen-AI business.
I don’t think we need new laws on infringement. But we may need new ways to manage copyright remedies. Courts and lawyers have solved problems like that in the past. During the 1980s, for instance, they came up with “market share liability”: a new, counter-intuitive, brilliant way to manage liability and remedies in mass-tort cases.4See LSDefine for a simple explanation of market-share liability. I hope similar creativity will soon give us a solution for generative AI and copyright.
* * *
If you’d like to learn more about AI law and contracts, we can help – with videos and other training:
- AI Contracts: Drafting and Negotiating, on-demand – recorded program, w/ 12 months’ access, on IP, liability, and other issues in contracts about AI, especially gen-AI
- Artificial Intelligence Contracts: Drafting and Negotiating, live webinar – Feb. 13, 2024 webinar on the topics in the bullet above, including live Q&A
- Key Liability Terms in Contracts about AI, the Cloud, and other Software, on demand – recorded program, w/ 12 months’ access, on liability terms in IT contracts, including special gen-AI issues
- Other on-demand and live webinar options from Tech Contracts Academy®
© 2024 by Tech Contracts Academy, LLC. All rights reserved.
The opinions in this post do not reflect the views of any legal client of the author or of his law firm, and they are not based on information disclosed by any such client.
- Artificial Intelligence Brain Think, by geralt, courtesy of Pixabay
- Portrait of Elizabeth I of England – the Armada Portrait, Anonymous, c. 1588
- Colorful portrait of a woman’s mind, by tommyvideo, filtered for color, courtesy of Pixabay
- Man author, by FreeFunArt, courtesy of Pixabay
- 1The Queensland Brain Institute offers a short summary of what we do know, or at least some of it: “How are memories formed?”
- 2Coincidentally, modern copyright law began with another of England’s few reigning queens. In 1710, just over a century after QE-1 died, Parliament passed The Statute of Anne, under Elizabeth’s cousin many times removed: Queen Anne of Great Britain and Ireland.
- 3Click here to read my full conversation with ChatGPT. And for more concerns about whether gen-AI copies training data, see my earlier article, “Watermarks on Generative AI Art … and Copyright.”