OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.
You kind of do. Fair use protects reverse engineering, indexing for search engines, and other forms of analysis that create new knowledge about works or bodies of works. These models are meant to be used to create new works which is where the “generative” part of generative models comes in, and the fact that the models consist only of original analysis of the training data in comparison with one another means as your tool, they are protected.
https://en.wikipedia.org/wiki/Fair_use
Fair use only works if what you create is to reflect on the original and not to supercede it. For example if ChatGPT gobbled up a work on the reproduction of firefies, if you ask it a question about the topic and it just answers, that’s not fair use since you made the original material redundant. If it did what a search engine would do and just tell you that “here’s where you can find it, you might have to pay for it”, that’s fair use. This is of course US law, so it may be different everywhere, and US law is weird so the courts may say anything.
That’s the gist of it, fair use is fine as long as you are only creating new information and only use the copyrighted old work as is absolutely necessary for your new information to make sense, and even then, you can’t use so much of the copyrighted work that it takes away from the value of it.
Otherwise if I pirated a movie and put subtitles on it, I could argue it’s fair use since it’s new information and transformative. If I released the subtitles separately, that would be a strong argument for fair use. If I included a 10 sec clip in it to show my customers what the thing is like in action, then that may be argued. If it’s the pivotal 10 seconds that spoils the whole movie, that’s not fair use, since I took away from the value of the original.
ChatGPT ate up all of these authors’ works and for some, it may take away from the value they have created. It’s telling that OpenAI is trying to be shifty about it as well. If they had a strong argument, they’d want to settle it as soon as possibe as this is a big stormcloud on their company IP value. And yeah it sucks that people created something that may turn out to not be legal because some people have a right to profit from some pieces of capital assets, but that’s the story of the world the past 50 years.
First of all, fair use is not simple or as clear-cut a concept that can be applied uniformly to all cases than you make it out to be. It’s flexible and context-dependent on careful analysis of four factors: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use upon the potential market. No one factor is more important than the others, and it is possible to have a fair use defense even if you do not meet all the criteria of fair use.
Generative models create new and original works based on their weights, such as poems, stories, code, essays, songs, images, video, celebrity parodies, and more. These works may have their own artistic merit and value, and may be considered transformative uses that add new expression or meaning to the original works. Providing your own explanation on the reproduction of fireflies isn’t making the original redundant nor isn’t reproducing the original, so it’s likely fair use. Plenty of competing works explaining the same thing exist, and they’re not invalid because someone got to it first, or they’re based on the same sources.
Your example about subtitling a movie doesn’t meet the criteria for fair use because subtitling a movie isn’t a transformative use. It doesn’t add any expression or meaning, you doubly reproduce the original work in a different language, and it isn’t commentary, criticism, or parody. Subtitling a movie also involves using the entire work, which again weighs against fair use. The more of the original you use, the less likely it’s fair use. This might also have a negative effect on the potential market for the original, since it could reduce demand for the original or its authorized translations. Now, subtitling a short clip from a movie to illustrate a point in an educational video or a review would likely fly.
Finally, uses that can result in lost sales for already established markets tend to be determined as not fair use by the courts. This doesn’t mean that uses that affect the market are unfair. That would mean you wouldn’t be able to create a parody movie or use snippets of a work for a review. These can be considered a fair use because they comment on or criticize the original work, unlike uploading a full movie, song, or translated script. Though I could be getting the wrong read here, since you didn’t explain how you came to any of your conclusions.
I think you’re being too narrow and rigid with your interpretation of fair use, and I don’t think you understand the doctrine that well. I recommend reading this article by Kit Walsh, who’s a senior staff attorney at the EFF, a digital rights group, who recently won a historic case: border guards now need a warrant to search your phone. I’d like to hear your thoughts.
I am not a lawyer by the way, I don’t even live in the US, so what I write is just my opinion.
But fair use seems a ridiculous defense when we talk about the Github Copilot case, which is the first tangible lawsuit about it that I know of. The plaintiffs lay out the case of a book for Javascript developers as their example. The objective of the book to give you excercises in Javascript development, I would get the book if I wanted to do Javascript excercises. The book is copyrighted under a share-alike attribution required licence. The defendants Github and OpenAI don’t honour the license with Copilot and Codex. They claim fair use.
So with the four factors:
the purpose and character of your use: .Well, they present their Javascript excercises as original work while it’s obvious they are not, they are reproducing the task they want letter by letter. It is even missing critical context that makes it hard to understand without the book, so their work does not even stand on its own. Also, they do this for monetary compensation, while not respecting the original license, which if someone was giving a commentary or criticism covered by fair use, would be as trivial as providing a citation of the book. They are also not producing information beyond what’s available in the book. Quite funnily, the plaintiffs mention that the “derivative” work is also not quite valuable, as the model answered with an example from a “what’s wrong with this, can you fix it?” section for a question about how to determine if a number is even.
the nature of the copyrighted work: It’s freely available, the licence only requires if you republish it, you should provide proper attribution. It is not impossible to provide use cases based on fair use while honouring the license. There is no monetary or other barrier.
the amount and substantiality of the portion taken: All of it, and it is reproduced verbatim.
the effect of the use upon the potential market: Github Copilot is in the same market as the original work and is competing with it, namely in showing people how to use Javascript.
And again, I feel this is one layer. Copyright enforcement has never been predictable, and US courts are not predictable either. I think anything can come of this now that it’s big tech that is on the defendant side, and they have the resources to fight, not like random Joe Schmoes caught with bootleg DVDs. Maybe they abolish copyright? Maybe they get an exception? Since US courts have such wide jurisdiction and can effectively make laws, it is still a toss-up. That said, the Github Copilot class action case is the one to watch, and so far, the judge denied orders to dismiss the case, so it may go either way.
Also by the way, the EU has no fair use protections, it only allows very specific exceptions for public criticism and such, none of which fits AI. Going by the example of Copilot, this would mean that EU users can’t use Copilot, and also that anything that was produced with the assistance of Copilot (or ChatGPT for that matter) is not marketable in the EU.
I am not a lawyer either or a programmer for that matter, but the Copilot case looks pretty fucked. We can’t really get a look at the plaintiff’s examples since they have to be kept anonymous. Generative models weights don’t copy and paste from their training data unless there’s been some kind of overfitting, and some cases of similar or identical code snippets, might be inevitable given the nature of programming languages and common tasks. If the model was trained correctly, it should only ever see infinitesimally tiny parts of its training data. We also can’t tell how much of the plaintiff’s code is being used for the same reasons. The same is true of the plaintiff’s claims about the “Suggestions matching public code”.
This case is still in discovery and mired in secrecy, we might not ever find out what’s going on even once the proceedings have concluded.