Copyright and generative AI

A short review of legal and ethical issues around generative AI, focusing on copyright

Copyright is all about text, images, audio, video, and other forms of creative expression. Most current generative AI tools are -also- all about text, images, audio, and video - so it's not surprising that there are a lot of copyright issues related to the generative AI field!

There are copyright issues around training AI models, sharing materials with existing AI tools, and creating new content with AI tools. Most of these issues are very unsettled right now, and are likely to continue evolving rapidly in the next few years. There are also some related legal and ethical issues, that are also evolving rapidly.

Definitions

"AI" is used to mean a lot of things, but for legal discussion it's important to be clear about meanings. The definitions below are what we will be using within this page. They're a little simplistic and may be incomplete, but hopefully they will help increase clarity.

generative AI - tools that produce new materials like text, images, audio, or video in response to user prompts.

general AI tools - tools that are trained on big, broad sets of inputs in order to produce a very wide range of outputs. Usually consumer-grade, available to the public or to consumers or subscribers. Examples include ChatGPT, MidJourney, CoPilot, Gemini, etc. Many tools available on for more targeted purposes use general AI tools as their underlying basis, such as chatbots that aim to replace humans in customer service or therapy interactions.

specialized AI tools - tools that are trained on a focused set of materials, in order to perform specific functions. These are not usually available to the public, even for a fee. They are usually used by researchers, and/or to perform very specific tasks. Some specialized AI tools are not really "generative", in that they don't produce new materials - they may be used more for analysis and synthesis within existing sets of materials. Examples include tools for screening medical imaging, and some corporate knowledge bases.

LLMs (large language models) - statistical algorithms developed through analysis of a very large set of text inputs, which can usually produce (and respond to) natural human language. LLMs form the basis of a lot of the general consumer AI tools today.

Image/audio/video generation models - statistical algorithms that are developed through analysis of sets of related media and text inputs (such as labelled images or captioned video or audio), which can produce new images, audio, or video. Most of these rely on LLMs for the language processing tasks involved.

Ownership and AI

Copyright ownership

In the US, copyright ownership for generative AI outputs is already fairly settled. Human creativity is required for a copyright to exist, so most outputs of generative AI tools are not eligible for copyright on their own. That is, most generative AI outputs do not have any copyright. However, when you add human creative expression, that creative expression can have a copyright - and many generative AI outputs are then manipulated by people in ways that can create a copyright.

For example, a series of images produced by an image generator are not eligible for a copyright on their own. (That is, they do not have a copyright, there is no copyright owner.) But if human authors use that same series of images to make a comic book, the process of selecting and arranging the images is human creative expression and the whole book is eligible for a copyright. (That is, the book as a whole has a copyright, and a copyright owner. The pictures individually still don't.)

Contracts about ownership and use

Although there may be no copyright owner for many AI outputs (which usually means anyone can use a creative work), there may be contracts that claim to establish ownership, or that limit how the outputs of AI systems can be used. For example, most general AI tools have terms of use that say who can use the output, and for what purposes. In many cases, you can't use the output commercially without a paid subscription.

Using existing works to train generative AI tools

Copyright issues in training

Training most generative AI tools involves use of enormous sets of text, images, audio, video, and other types of copyrightable creative works. While a few providers of AI tools have trained their models using only materials which are in the public domain, or for which the training company has a relevant license, most training is done on in-copyright materials.

Most companies and other organizations training on in-copyright materials believe that this training is legal under the United States copyright provision for fair use. Their legal argument is derived from principles that were developed in cases about search engines (which copied web content to make it findable), and about book scanning projects (which copied books to make it possible to search inside them.) Broadly speaking, this kind of copying often does not involve humans having access to the entirety of the original creative works, and does involve using the works for an unusual (or "transformative") purpose, which both tend to strengthen fair use claims. (It is also slightly more likely to be fair use for non-profit researchers to do this, than for commercial companies.)

Currently, numerous copyright lawsuits are pending in the United States and elsewhere over the legality of copying existing materials without permission for training AI systems. There is a lot of marketing rhetoric from general AI companies that suggests there are no copyright issues with using existing materials as training data, and that's not really correct. The fact that the suits are underway does indicate that the arguments both for -and- against fair use are strong - if it was immediately obviously fair use, the cases would have wrapped up by now. However, some of the early verdicts out of these cases are affirming fair use even in commercial applications, especially of legally-acquired (rather than pirated) original works.

Non-profit research projects using existing materials to train AI systems are even more likely to be able to rely on fair use than commercial actors. Similarly, due to their more strongly transformative purpose, researchers developing specialized AI tools may be more able to rely on fair use. However, researchers should pay careful attention to developing case law, as some unexpected rulings may have implications even there.

Contract issues in training

While the copyright law on using existing materials as training data is evolving, many existing content owners are taking action to prevent their materials from being used this way via contracts, such as terms of use or subscription contracts. For example, most subscribers to streaming TV and movie services have agreed to terms of use that say you cannot use those materials for anything but personal, noncommercial viewing. And the research journals and databases that the University Libraries subscribe to on behalf of campus also often only allow for noncommercial research use. In these cases, a contract provides a legal limit on using the materials as training data, even though the copyright law question is still open.

If you violate a contract, the most likely result is that your business relationship with the other parties to the contract will end - although lawsuits are still possible, they're rare. So if you violate your streaming media terms of service, you will likely find yourself cut off from that service. If you violate the subscription terms for Libraries susbcription resources, all campus users may find themselves cut off from acccess. Some of our subscription resources -do- have options for using them with AI, but please read more about that, or get in touch, if you are interested in doing so.

Contractual limitations on using existing materials to train AI tools exist regardless of the developers' commercial or non-profit intentions, and often even regardless of research or scholarly intentions - they depend in particular on the specific terms of a specific contract, so researchers should be careful in this area.

Occasionally, a contract or license may expand the possibilities of using existing materials to train AI tools - an open license such as a Creative Commons license pre-approves some kinds of uses. Note that Creative Commons licenses often prohibit commercial use, and it's unclear how far down the chain of a training process those prohibitions might persist. Note also that most variations on Creative Commons licenses require attribution as a condition of use - it's currently unsettled how this interacts with training uses (since most LLMs and related systems are not very capable of true attribution.)

Creative Commons licenses also explicitly preserve fair use and other copyright exceptions, so if at some point it becomes settled law that fair use covers training uses, a lot of these considerations specific to Creative Commons (or other open licenses) may become irrelevant. But at the moment, CC licenses and other open licenses can be one way to create certainty in training sets.

Sharing materials with generative AI tools

When you share materials with most general AI tools, they keep a copy (to use as training data, or for other purposes down the line.) Some general AI tools also include in their terms of use a promise from you, both that you are giving them a license to use anything you upload, and that you have the right to do so. All the same things that limit use of existing materials as training data, also apply to uploading materials to existing AI tools and services.

In the same ways that limits around using existing materials as training data also usually apply to sharing existing materials with existing tools - Creative Commons and other open licenses expand the possibilities both for using works to train AI tools and for sharing existing works with AI tools. Again, keep in mind that a lot of Creative Commons licenses prohibit commercial use, and most require attribution - and how those interplay with sharing materials with AI tools is not particularly clear right now.

If you are uploading materials you have created, keep in mind that you are often giving a broad license to the service provider to use your work. If you are uploading materials created by others, be aware of contractual restrictions on your use (especially of things you subscribe to or that others subscribe to for you), copyright law (which has not yet fully answered whether sharing others' materials with AI systems is legal), and general ethics and interpersonal relationships (most people don't love it when other people share their creative output without permission.)

Some general AI tools, especially those you pay a fee for, or where someone else pays a fee for on your behalf, may be "closed loop" systems, where the service provider does NOT retain copies of materials uploaded. These generally present fewer copyright and ethical challenges (but not always none.) Contractual limitations usually still apply to these tools.

Legal and ethical concerns beyond copyright or contracts may also limit what you can share with AI tools - keep in mind privacy laws like FERPA and HIPAA, IRB restrictions on research data, confidentiality agreements with research sponsors or participants, and other ethical limitations. Although these concerns may also be lower with "closed loop" systems, the opacity of most AI tools means that it is difficult to be -certain- no information is being retained, so serious consideration of legal and ethical requirements is needed even there.

"Style" and substitution - not well handled by current law

One thing that many people find concerning about generative AI tools is their ability to generate new materials in the style of a human artist, or with the "voice" of a specific human writer or speaker (living or dead.) To many creative artists, making a new work that is in the style of an existing work feels like a copyright violation. However, copyright does not protect "styles"; it protects specific creative works. (Trademark law may offer some degree of protection for certain stylistic elements, but even that does not typically provide complete protection or bar all other uses of that style.)

It's a longstanding artistic tradition to try to produce things in the style of other artists - or even to copy existing works to learn from them. It certainly feels different at the scale that an AI tool can do this, rather than how a human artist can copy from another human - and there may yet be some legal remedies for stylistic copying, especially under legal regimes with different approaches from the US. Rights of artistic integrity, for example, exist in most European copyright law, and may limit the stylistic copying AI tools provide.