Contracts about artificial intelligence raise most of the same issues as other software agreements. But they raise unique issues too, particularly when they involve generative AI (like ChatGPT) and other forms of machine learning. This is the first of three posts describing those issues. This series provides an issue-spotter for lawyers, contract managers, and other contract-drafters.
Let’s start by defining key terms:
- Artificial intelligence or AI refers to computer systems able to perform tasks that usually require a human brain. It’s a pretty vague concept.1For a good general explanation, see, “What Is Artificial Intelligence? Everything You Need to Know,” G2.
- Fine Tuning refers to further training of a machine learning system through data provided by the customer – a.k.a. “fine tuning data.” In other words, the customer helps train the AI, usually just for the customer’s own use.
- Generative AI or gen-AI means artificial intelligence that can create new assets, like text, audio, images (including deep fakes), and software code. Examples include large language models like ChatGPT and Claude from Anthropic, as well as art-creation systems like NightCafe. Generative AI is trained through machine learning (specifically, deep learning).2See, “Generative AI,” Technopedia. See also, ChatGPT from OpenAI, Claude from Anthropic, and NightCafe.
- Machine learning refers to creation of AI systems by “training” them, rather than just programming them. Machine learning systems can improve through training, unlike most computer systems, which are limited to their programming. (Deep learning is a sophisticated, complex version of machine learning.)3See, “What Is Machine Learning and How Does It Work?,” G2.
- Output refers to an AI system’s response to a prompt – e.g., an answer to a question, content like an image or software created on request.
- Prompt refers to the question or other request a user submits to an AI system.
- Input data and similar terms generally refer to information provided by the customer to guide the AI. Input data could come in the form of a prompt or an attachment to a prompt. It could also come in the form of special training data provided by the customer (fine tuning data). (That’s a general definition. The industry doesn’t use “input data” consistently.)
- Training data means a dataset used to teach a machine learning system how to do its job – how to generate outputs. The AI provider supplies most training data, in some cases by pulling together massive datasets, like much of the Internet. But customers sometimes add additional training data through fine tuning (above). Prompts can also become training data, particularly with generative AI. In rare cases, outputs can become training data too, as the system essentially teaches itself.4See, “What Is Training Data? How It’s Used in Machine Learning,” G2.
The rest of this post addresses contracts for purchase and sale of AI, with a particular focus on gen-AI. It covers (A) ownership and control of customer training data and prompts. The two future posts in this series will cover: (B) ownership and control of outputs; (C) trade secrets and other confidential information related to AI; (D) liability related to errors in outputs; (E) liability to third parties related to AI inputs and outputs (IP, privacy, defamation, etc.); and (F) security, responsible use, and other special issues related to artificial intelligence.
A. Ownership and Control of Customer’s Fine-Tuning Data and Prompts
- General Problems with “Ownership” of Data: Contrary to common belief, you can’t own data or other information in any meaningful sense. Information can’t be patented, and copyright protects only expression, not the information expressed. So what does it mean to “own” data? Not much, but not nothing. Your data could include trade secrets, and ownership terms might help you protect them (though not as much as confidentiality terms). You could also claim a weak sort of copyright in the compilation of your dataset. You’d have limited rights to keep others from copying the compilation. But your copyright would not keep anyone from using your data, if they had a copy. In other words, claim data ownership, for what it’s worth, but don’t rely on it. For real protection, turn to contract terms restricting use of data, addressed below and throughout this series. For ownership itself, here’s suggested language — protecting the owner: “Part B does not contest Party A’s claim to own the Data, and this Agreement does not transfer to Party B any title or other ownership rights in or to Data. Party B recognizes and agrees that: (a) the Data is Party A’s valuable property; (b) the Data includes Party A’s trade secrets; (c) the Data is an original compilation pursuant to the copyright law of the United States and other jurisdictions; and (d) Party A has dedicated substantial resources to collecting, managing, and compiling the Data.” A less balanced version — more favorable to the owner — would replace the first sentence with, “The Parties recognize and agree that Party A owns the Data.” That creates problems for Party B, discussed below in bullet 3. (For more on data ownership, see “Ten Key Data Terms in IT Contracts: An Issue-Spotter (Updated)”, bullet 4 — and The Tech Contracts Handbook, 3rd ed., Ch. II.J.1.)
- Prompts and Customer Training Data & Input Data – Customer Ownership & Control: If possible, the customer should claim ownership of its employee and contractor prompts and of any training data and input data it provides. Use the language above in bullet 1.5Replace “Data” with “Prompts” and/or something like “Customer Data” — and also “Party A” and “Party B” with “Customer” and “Provider,” respectively. More importantly, the customer should add contract terms restricting use of prompts and training data. The following addresses data not used to train the AI: “Provider shall not access or use any Prompt or Customer Input Data other than as necessary to provide the System to Customer.” And the following addresses both prompts and training data that are used to train the customer’s copy or instance of the gen-AI: “Provider shall not access or use Customer Training Data (including without limitation Prompts)6This assumes you’ve defined “Customer Training Data” to include Prompts. If so, you could actually dispense with the “including …” language here. for any purpose other than to train and maintain Customer’s model or copy of the System.” (See bullet 4 below for training data used on the provider’s separate products or services.)
-
Prompts & Customer Training Data – Provider’s Rights: For the provider, customer ownership of prompts and customer data doesn’t necessarily create a problem. Neither does accepting limits on control. However, prompts and customer data could include information the provider also receives from a third party and needs to use. And they could include the provider’s own information: trade secrets, copyrighted text, and even patents or patent applications, assuming the customer’s staff has access to those materials. So the provider should consider two precautions. First, it should distinguish between assigning ownership and merely accepting that the deal doesn’t give it ownership rights — and avoid the former. The first example above in bullet 1 achieves that, in the first sentence. Second, the provider should clarify that any customer ownership does not extend to prompts or other data it independently receives or develops. (NDAs draw the same line.) “Customer’s rights in Section __ (Ownership & Restricted Use of Prompts and Customer Data) do not restrict Provider’s ownership of or other rights to information Provider independently (a) develops or (b) receives from a third party. Provider does not assign or license to Customer any right, title, or interest in or to Prompts or Customer Data.“
- Customer Training Data and Prompts Used to Train Provider’s Products – Provider’s License: What if a generative AI system uses customer training data, prompts, or both to train the provider’s separate products and services? The provider still should not necessarily object to customer ownership or control. But it needs protections like those listed in bullet 3 above. The provider also needs clear rights to that training data. Provider-friendly language might read: “Customer hereby grants Provider a perpetual, irrevocable, worldwide, royalty-free license to reproduce, modify, and otherwise access and use Customer Input Data (including without limitation Prompts) to train and otherwise modify the System and other Provider products or services, with the right to sublicense any such right to Provider’s contractors supporting such training or modification.“7The example uses “license” terms because of the potential for a compilation copyright, which could be licensed. But the fit between licensing language and data remains questionable, so consider replacing “license” with “authorization” or something similar. Also, if the provider uses customer inputs to train the AI for use of third parties – not just for the customer’s use – the provider should consider a clarification: “Customer recognizes and agrees that, as a result of such use of Customer Input data, the System could reproduce elements of such data in outputs provided to Provider’s other customers.“
- Provider Training Data – Provider Control w/ Limited Customer Use: The customer generally gets no access to training data supplied by the provider. And the customer doesn’t need rights to use that data, at least, not in addition to its right to the AI itself – with one exception. If the provider’s training data includes content the provider owns – subect to its copyright – the customer should have rights to use it to the extent it shows up in outputs (which may be rare). Each party then could benfit from some clarifications on that front: “This Agreement does not transfer to Customer any ownership of Provider Training Data or any right to access or use Provider Training Data, except as set forth in the next sentence. To the extent (if any) that System includes in outputs content that Provider owns or otherwise has rights to reproduce, Provider hereby grants Customer a nonexclusive license to reproduce, distribute, modify, publicly perform, publicly display, and otherwise exploit such content, with the right to sublicense each and every such right.“
The next post in this series addresses IP in outputs and confidentiality related to AI, including trade secrets. Click here. The third post will address liability related to privacy, defamation, and IP, as well as responsible use of AI and other issues.
© 2023, 2024 by Tech Contracts Academy, LLC. All rights reserved.