Contracts about artificial intelligence raise most of the same issues as other software agreements. But they raise unique issues too, particularly when they involve generative AI, like ChatGPT. This is the first of three posts describing those issues. This series provides an issue-spotter for lawyers, contract managers, and other contract-drafters.
Let’s start by defining key terms:
- Artificial intelligence or AI refers to computer systems able to perform tasks that usually require a human brain. It’s a pretty vague concept.1For a good general explanation, see, “What Is Artificial Intelligence? Everything You Need to Know,” G2.
- Generative AI means artificial intelligence that can create new assets, like text, audio, images (including deep fakes), and software code. Examples include “large language models” like ChatGPT and Claude from Anthropic, as well as art-creation systems like NightCafe. Generative AI is trained through machine learning (specifically, deep learning).2See, “Generative AI,” Technopedia. See also, ChatGPT from OpenAI, Claude from Anthropic, and NightCafe.
- Machine learning refers to creation of AI systems by “teaching” them, rather than just programming them. Machine learning systems can improve through training, unlike most computer systems, which are limited to their programming. (Deep learning is a sophisticated, complex version of maching learning.)3See, “What Is Machine Learning and How Does It Work?,” G2.
- Output refers to an AI system’s response to a prompt – e.g., an answer to a question or an illustration created on request.
- Prompt refers to the question or other request a user submits to an AI system.
- Input data and similar terms refer to data the AI analyzes but that isn’t use for training. Input often comes from the customer, but it could come from the provider. In some cases, the customer provides input data with its prompt, though we might just call that a very long prompt.
- Training data means a dataset used to teach a machine learning system how to do its job – how to generate outputs. The AI provider supplies most training data, in some cases by pulling together massive datasets, like much of the Internet. But customers sometimes add additional training data to help the AI address the customer’s particular needs. Prompts can also become training data, particularly with generative AI. In rare cases, outputs can become training data too, as the system essentially teaches itself.4See, “What Is Training Data? How It’s Used in Machine Learning,” G2.
The rest of this post covers (A) ownership and control of customer training data and prompts. The two future posts in this series will cover: (B) ownership and control of outputs; (C) trade secrets and other confidential information related to AI; (D) liability related to errors in outputs; (E) liability to third parties related to AI inputs and outputs (IP, privacy, defamation, etc.); and (F) security, responsible use, and other special issues related to artificial intelligence.
A. Ownership and Control of Customer Training Data and Prompts
- General Problems with “Ownership” of Data: Contrary to common belief, you can’t own data or other information in any meaningful sense. Information can’t be patented, and copyright protects only expression, not the information expressed. So what does it mean to “own” data? Not much, but not nothing. Your data could include trade secrets, and ownership terms might help you protect them (though not as much as confidentiality terms). You could also claim a weak sort of copyright in the compilation of your dataset. You’d have limited rights to keep others from copying the compilation. But your copyright would not keep anyone from copying smaller points of information within the dataset. And it might not keep anyone from using your data. In other words, claim data ownership, for what it’s worth, but don’t rely on it. For real protection, turn to contract terms restricting use of data, addressed below and throughout this series. For ownership itself, here’s suggested language — protecting the owner but still relatively balanced between the two parties: “Party A claims ownership of the Data, and this Agreement does not transfer to Party B any title or other ownership rights in or to Data. Party B recognizes and agrees that: (a) the Data is Party A’s valuable property; (b) the Data includes Party A’s trade secrets; (c) the Data is an original compilation pursuant to the copyright law of the United States and other jurisdictions; and (d) Party A has dedicated substantial resources to collecting, managing, and compiling the Data.” A less balanced version — more favorable to the owner — would replace the first sentence with, “The Parties recognize and agree that Party A owns the Data.” That creates problems for Party B, discussed below in bullet 3. (For more on data ownership, see “Ten Key Data Terms in IT Contracts: An Issue-Spotter (Updated)”, bullet 4 — and The Tech Contracts Handbook, 3rd ed., Ch. II.J.1.)
- Prompts and Customer Training Data & Input Data – Customer Ownership & Control: If possible, the customer should claim ownership of its employee and contractor prompts and of any training data and input data it provides. Use the language above in bullet 1.5Replace “Data” with “Prompts” and/or something like “Customer Data” — and also “Party A” and “Party B” with “Customer” and “Provider,” respectively. More importantly, the customer should add contract terms restricting use of prompts and training data. The following addresses data not used to train the AI: “Provider shall not access or use any Prompt or Customer Input Data other than as necessary to provide the System to Customer.” And the following addresses both prompts and training data that are used to train the customer’s copy or instance of the AI: “Provider shall not access or use Customer Training Data (including without limitation Prompts)6This assumes you’ve defined “Customer Training Data” to include Prompts. If so, you could actually dispense with the “including …” language here. for any purpose other than to train and maintain Customer’s model or copy of the System.” (See bullet 4 below for training data used on the provider’s separate products or services.)
Prompts & Customer Training Data – Provider’s Rights: For the provider, customer ownership of prompts and customer data doesn’t necessarily create a problem. Neither does accepting limits on control. However, prompts and customer data could include information the provider also receives from a third party and needs to use. And they could include the provider’s own information: trade secrets, copyrighted text, and even patents or patent applications, assuming the customer’s staff has access to those materials. So the provider take two precautions. First, it should distinguish between assigning ownership and merely accepting that the deal doesn’t give it ownership rights — and avoid the former. The first example above in bullet 1 achieves that, in the first sentence. Second, the provider should clarify that any customer ownership does not extent to prompts or other data it independently receives or develops. (NDAs draw the same line.) “Customer’s rights in Section __ (Ownership & Restricted Use of Prompts and Customer Data) do not restrict Provider’s ownership of or other rights to information Provider independently (a) develops or (b) receives from a third party. Provider does not assign or license to Customer any right, title, or interest in or to Prompts or Customer Data.“
- Customer Training Data and Prompts Used to Train Provider’s Products – Provider’s License: What if the AI uses customer training data, prompts, or both to train the provider’s separate products and services? That’s particularly likeley with generative AI. The provider still should not necessarily object to customer ownership or control. But it needs protections like those listed in bullet 3 above. The provider also needs clear rights to that training data. Provider-friendly language might read: “Customer hereby grants Provider a perpetual, irrevocable, worldwide, royalty-free license to reproduce, modify, and otherwise access and use Customer Training Data (including without limitation Prompts) to train and otherwise modify the System and other Provider products or services, with the right to sublicense any such right to Provider’s contractors supporting such training or modification. Provider has no obligation to report on use of Customer Training Data, except as set forth in Section __ (Nondisclosure).“7The example uses “license” terms because of the potential for a compilation copyright, which could be licensed. But the fit between licensing language and data remains questionable, so consider replacing “license” with “authorization” or something similar.
- Provider Training Data – Provider Control w/o Customer Rights: The customer generally gets no access to training data supplied by the provider. And the customer doesn’t need rights to use that data — at least, not in addition to its right to the AI itself. So the provider can draft simple, protective language: “This Agreement does not transfer to Customer any ownership of Provider Training Data or any right to access or use Provider Training Data.“
The next post in this series addresses IP in outputs and confidentiality related to AI, including trade secrets. Click here. The third post will address liability related to privacy, defamation, and IP, as well as responsible use of AI and other issues.
© 2023 by Tech Contracts Academy, LLC. All rights reserved.
Illustrations created by the author using generative AI from NightCafe.
- 1For a good general explanation, see, “What Is Artificial Intelligence? Everything You Need to Know,” G2.
- 3See, “What Is Machine Learning and How Does It Work?,” G2.
- 5Replace “Data” with “Prompts” and/or something like “Customer Data” — and also “Party A” and “Party B” with “Customer” and “Provider,” respectively.
- 6This assumes you’ve defined “Customer Training Data” to include Prompts. If so, you could actually dispense with the “including …” language here.
- 7The example uses “license” terms because of the potential for a compilation copyright, which could be licensed. But the fit between licensing language and data remains questionable, so consider replacing “license” with “authorization” or something similar.