Navigating the Complex Intersection of AI and Copyright Law: Analysis of the Authors Guild suit against OpenAI and Why Expertise Matters
By R Tamara de Silva
OpenAI Inc. is under scrutiny following allegations of copyright infringement by leading authors and the Authors Guild. The lawsuit (ย 1:23-cv-08292), filed in New York federal court, asserts that OpenAI utilized copyrighted works without authorization for training its AI product, ChatGPT. Prominent authors like John Grisham, George R. R. Martin, and Jodi Picoult, among others, have joined the complaint, suggesting OpenAI engaged in, "systematic theft on a grand scale." The suit alleges that OpenAI's success depends on and would not have been possible without mass copyright infringement. OpenAI was trained without seeking permission or compensating the original authors.
The Authors Guild has expressed concerns over the impact of AI tools like ChatGPT on writers, citing a significant decline in work opportunities for some and the emergence of low-quality AI-generated content impersonating known authors. The Guild had previously communicated these concerns to OpenAI in an open letter. Despite these challenges, the lawsuit aims to represent US fiction writers with a sales record exceeding 5,000 copies, seeking damages for lost licensing opportunities and a halt on using copyrighted content in future AI training.
However, there are potential weaknesses in the case against OpenAI. The nature of large language models like ChatGPT is such that they're trained on vast datasets, making it challenging to ascertain if, or which, specific works from individual authors were used. The sheer volume of data involved means pinpointing particular copyrighted content in the training data could prove difficult. This could complicate the plaintiffs' efforts to conclusively demonstrate direct infringement by OpenAI.
But the idea that infringement can be proven at all represents a fundamental lack of understanding about how AI models are trained. Extracting individual data from systems like ChatGPT to prove infringement of a particular author is not only a daunting challenge, but it also presents several inherent weaknesses and difficulties. Here are some reasons why:
- Aggregated and Transformed Nature of Training Data: After training on vast datasets, the knowledge is stored in an aggregated and transformed way, meaning the model doesn't retain specifics about individual texts. It learns patterns rather than specifics. Thus, it's virtually impossible to reverse-engineer the model to extract individual works.
- No Direct Replication of Original Texts: While the model can generate text based on patterns it has learned, it doesn't simply "copy and paste" from its training data. Hence, proving that any specific generation from the model is a direct result of one specific piece of training data is problematic.
- Volume of Data: The vast amount of data used to train models like ChatGPT means that even if copyrighted material was included, it would be a minuscule fraction of the total data. Determining the impact or influence of any single work on the model's overall behavior would be a significant challenge.
- Data Sources: OpenAI, or other organizations, might not always have direct oversight or knowledge of every individual piece of content in their training datasets, especially if they use broad internet scrapes. This makes it difficult to ascertain with certainty what specific works were included.
- Ambiguity of "Original" Content: A major challenge would be to define what constitutes original content, especially given that many phrases, sentences, or ideas might be ubiquitous or generic. Drawing the line between general knowledge and proprietary content is complex.
- Technical Challenges: Even if one wanted to, technically dissecting a model to pinpoint specific training data is currently beyond our capabilities. The weights in the model don't correspond to specific texts or pieces of knowledge in any discernible way.
Given these factors, while there may be moral or ethical considerations regarding the training of large models on vast datasets, from a legal and technical perspective, proving direct infringement or extracting specific training data will be challenging.
At the De Silva Law Offices, we not only bring expertise in white collar defense and financial markets but also possess a deep understanding of AI and machine learning, thanks to years of study on these topics at two of the world’s leading universities. So if you find yourself facing litigation or any matter involving artificial intelligence, we urge you to reach out to us for a knowledgeable and strategic consultation.
R Tamara de Silva