OOC Advocates logo
OOC Advocates logo

JUSTIN OKARA

Partner

Generative AI is the newest buzz around the tech domain; the technology allows for producing multiple types of content not limited to text, images, audio, video, and even synthetic data. Although one should note that the technology is not necessarily new, the minimalism and ease of use it gives end-users have driven the record statistics; for example, OpenAI's chatbot ChatGPT recorded 100 million users in two months of its launch. The output from Generative AI involves a great deal of data input processed by highly complex algorithms, which have developed unique and involute legal issues, especially for copyright and data protection.

This blog attempts to illustrate these issues from two ends - the input and output of Generative AI and its legality. It will explore the legal and ethical implications of using shadow libraries containing copyrighted material, examining training data acquisition through text-based pools on the dark web and image-scraping AI systems. Beyond the training phase, it will question the possibility of copyright protection for the generated content. The jurisdiction dependency of copyright laws will be highlighted, as observed in the US, UK, EU, and Kenya and how this adds to the further complexity of copyright ownership of AI-generated content. At that, the discussion will touch upon the evolving concept of fair use, particularly in the context of academia and non-profits utilising copyrighted material for AI training purposes.

Focusing on the legal challenge of Generative AI, the blog explores potential compromises between AI companies and rightsholders, ranging from licensing agreements to stricter regulations on data usage. Ultimately, it underscores the impending legal challenges and transformations in the copyright laws that the nascent field of Generative AI is poised to bring in the years to come, urging creatives and AI users to stay abreast of these developments to safeguard intellectual property rights and avoid potential copyright infringement.

The Greyness of Input Data: Copyright-protected Data and Training AI Models

Given an input, such as text prompts, the AI models generate determined outputs, such as images that express or mimic the text prompt or text feedback that seems human-like. The training of AI brings about a lot of chatter on the legality of the activities that the AI needs to generate original works as it necessitates the data scrapping and extraction from innumerable datasets, mostly copyright-protected or otherwise.

The typical approach to handling copyrighted data involves obtaining consent from the rightsholder through a license. However, this method can be costly and cumbersome, prompting many companies to explore alternative and controversial solutions. An illustration of this is using text-based data pools, generally referred to as shadow libraries - which train Large Language Models (LLMs) such as OpenAI's ChatGPT and Google's Bard. These shadow libraries provide access to copyright material, including books and articles, that would otherwise have been restricted behind paywalls.

The crux of the matter lies in the legitimacy of specific shadow libraries. Case in point,
Project Gutenberg features a collection of eBooks with expired copyrights, such as Romeo and Juliet and Moby Dick, making it less contentious. On the flip side, shadow libraries like Z-Library, the self-described "world's largest eBook library," operate on the dark web and illicitly appropriate some of their books through piracy. Using unlawfully obtained data by specific shadow libraries raises concerns about copyright infringement and the ethical implications of training AI models using such data.

Similarly, AI systems that convert text to images, such as Stable Diffusion and Midjourney, scrape billions of images from diverse sources, ranging from personal blogs to stock imagery platforms like Getty Images and Shutterstock and even art platforms like DeviantArt. These AI-generated images often mimic a specific artist's style based on a given text prompt, steering clear of directly duplicating the artist's existing work. Consequently, this practice has led to
various lawsuits in multiple jurisdictions, probing whether the training data employed by companies utilising Generative AI models amounts to copyright infringement. Traditionally, copyright owners possess an exclusive right to reproduce their work, safeguarding their compensation. The eerie ability of AI to generate content from unlicensed or otherwise illegally obtained copyrighted material poses considerable concerns and potential legal implications for the rightsholders.

In jurisdictions deemed favourable, like the United States, the training data satisfies the
fair use doctrine - allowing the use of copyright material without explicit permission from the copyright owner, thus promoting freedom of expression. Given that generative AI reproduces patterns found in code, text, music, art, and other human-created data during the generation of output from trained data, it becomes arguable that a text prompt instructing “create a painting in the style of…" could be infringing on copyright - as it questions the legality and ethical considerations surrounding the expression of copyrighted works.

Meanwhile, within the European Union (EU), AI training activities are governed by Articles 3 and 4 of the
Directive on Copyright and Related Rights in the Digital Single Market. These articles encompass exceptions to copyright law specifically for text and data mining (TDM) purposes, both for scientific and "commercial" applications. However, TDM activities related to scientific purposes should not be published or publicly available. Additionally, the results derived from such activities should not be employed to create commercial products or services for business purposes. It is essential to highlight that the exceptions granted for TDM are not absolute; they necessitate a balance where the activities do not unreasonably prejudice the legitimate interests of the copyright owner.

 

Furthermore, the complexity of the fair use doctrine affords academia and non-profit organisations the ability to utilise copyrighted material without obtaining explicit permission, a fact well-recognized by companies. An illustrative example is LMU Munich, a German university that was pivotal in developing the underlying algorithm for the Stability Diffusion AI model. The university conducted data collection and model training, effectively mitigating legal liabilities for the associated company. While this move reinforces their fair use defence, some experts have criticised it, labelling it as "AI data laundering" for commercial companies.

 

The Output of AI Models and Copyright
 

As much as fair use and TDM exceptions could cover the training aspect of AI models, the subsequent inquiry delves into the output of a model and whether it's possible to copyright the generated content. The initial question revolves around whether an individual can copyright the output of a Generative AI model and, if so, who holds the ownership rights. Secondly, for those who own the rights to the training input data, does this confer a legal claim over the output produced by the model?

The answer to the initial issue becomes quite complicated as it varies across jurisdictions. In the United States, the stance is relatively straightforward. In 2019, the United States Copyright Office (USCO) clarified that copyright protection should not be extended to works solely created by "machines or other automated means," thereby
limiting the protection to works created by humans. However, the USCO may consider extending copyright protection if the creator can furnish sufficient evidence of substantial human input.

Concurrently, the EU - whose copyright law is based predominantly on the Berne Convention - takes a similar stance. The EU copyright law does not explicitly provide copyright protection for computer-generated content. However, the European Court of Justice (ECJ) is of the opinion that computer-generated content may be eligible if it results from a human's "creative process," thus bringing the author's personality to the fore. With that in mind, it is almost apparent that most output of Generative AI models, primarily based on keyword prompts, will not fit the bill for copyright protection.

On the contrary, if one explores various prompts for image creation, fine-tunes the images, or employs seeds to engineer the output further, an argument for copyright protection becomes plausible. Such manipulation may introduce elements that elucidate personality or intellectual involvement in the creative process.

The United Kingdom emerges as a viable jurisdiction for AI companies seeking copyright protection for their computer-generated works. However, the pieces must be original and the product of a human's creative process. In this context, human involvement entails making creative choices in the work, including the selection and arrangement of data, surpassing the apparent generative outcome of AI. The legislation further specifies that the author is "the author by whom arrangements necessary for the creation of the work are undertaken." Regardless, the debate lingers on the human aspect of the authorship, as the legislation doesn't explicitly state whether it refers to the model's developer or its operator.


Anyhow, shifting the focus to the Kenyan context, the copyright law parallels the UK's, granting copyright protection to the individual responsible for the necessary arrangements in creating the work. However, this aspect is yet to be fully litigated in Kenyan courts, leaving room for an exhaustive interpretation of necessary arrangements. While presumptions suggest the likely registration of works as copyright, the precise terms and limits of this possibility are anticipated to be subject to litigation across various jurisdictions.

Multiple
rightsholders have initiated legal action against Generative AI companies, alleging copyright infringement and unjust competition. These lawsuits assert that the companies utilise their works in their models without obtaining proper permission.

Navigating the Copyright Conundrum in Generative AI

 

Some argue that using copyrighted training data is permissible, although generating content from it may be considered breaking copyright rules. One of the principal protections that copyright gives artists and rightsholders is ensuring the protection of their creative expression and subsequent remuneration. Therefore, when a Generative AI model creates novel images, it's unlikely to amount to copyright infringement as the transformation of the training data does not pose a threat to the market for the original work.

Generative AI models, often characterised as pattern recognition software, excel in identifying patterns or solutions in how specific artists craft their works. Therefore, keyword prompts such as "draw an image in the style of..." may potentially cross the boundaries of protecting the expression of an idea, thereby raising concerns about copyright infringement. Furthermore, if the resulting image is available for purchase, it introduces competition with the original work, potentially causing unfair harm to the interests of the rightsholder.

Inevitably, Generative AI companies need to strike a balance with rightsholders, and the most prominent but contentious solution involves licensing the training data. Licensing allows the rightsholders to get compensation for the training of the AI models. It may seem like the best solution to split the difference, but how plausible is it to license all images, videos, audio, or text in the extensive training datasets? Drawing parallels with the once-illegal Napster era, complex licensing deals can satisfy multiple rightsholders, legitimising the generated content.

Another approach involves implementing stricter laws and regulations on the collection and use of training data; especially data scraped from the web without consent. Furthermore, given that the data in training sets has already been collected and utilised, it becomes crucial for AI companies to publicly disclose any copyrighted data used in training their models.

In conclusion, the emergence of generative AI introduces nuanced challenges to traditional copyright paradigms, calling for a delicate balance between fostering technological innovation and protecting the rights of creatives. The ongoing legal debates around fair use, licensing, and data rights underscore the need for evolving legislation and regulations to address the unique challenges posed by AI. Additionally, stricter regulations on collecting and using training data and transparency measures such as public disclosure of copyrighted data usage emerge as potential strategies to navigate the evolving legal landscape. The ongoing lawsuits, particularly those involving rightsholders and Generative AI companies, highlight the urgency for legal clarity and industry standards.


The rising tide of generative AI presents a catalyst for legal evolution, with profound implications for copyright laws, regulations, and the broader intellectual property landscape. As this transformative technology continues to shape the creative and technological realms, stakeholders must stay abreast of legal developments to navigate potential infringements and safeguard intellectual property rights in the dynamic and evolving landscape of Generative AI.

 

  • The rise of Generative AI has sparked complex legal challenges around copyright law, particularly concerning the use of copyrighted material in training data. Companies are facing scrutiny over their use of "shadow libraries" and web-scraped content to train their AI models without explicit permission from rightsholders.
     

  • The question of copyright protection for AI-generated content varies significantly across jurisdictions. The US and EU generally require substantial human input, while the UK offers more flexible protection for computer-generated works. The debate continues over whether model developers or operators should be considered authors.
     

  • Various solutions are emerging to address these copyright challenges, including licensing agreements with rightsholders, stricter regulations on data collection and usage, and transparency requirements for AI companies. These solutions draw parallels with past digital copyright challenges like the Napster era and highlight the need for evolving legislation to balance innovation with creative rights.

Intellectual Property

a person holding a cell phone in their hand
22 August 2023

The Nondescript Legality of Generative AI

OOC Advocates Logo

© 2025 Okara & Onuko Company Advocates. All rights reserved. The information on this website is for general information purposes only and should not be construed as legal advice. No action based on this content should be taken or omitted without seeking professional legal counsel.

LinkedIn OOC Advocates