Study Claims OpenAI Trains AI Models on Copyrighted Data

Read Time: 2 minutes

A recent investigation by the AI Disclosures Project reveals that OpenAI’s GPT-4o model demonstrates a high recognition of copyrighted O’Reilly Media content, suggesting potential unauthorized use of paywalled materials in AI training. The findings raise concerns about data transparency and compliance with intellectual property laws.

A new study from the AI Disclosures Project has raised concerns about OpenAI’s data training practices, particularly regarding the use of copyrighted materials. The research indicates that GPT-4o exhibits a “strong recognition” of paywalled O’Reilly Media books, suggesting that it may have been trained on proprietary content without authorization.

The AI Disclosures Project, led by technologist Tim O’Reilly and economist Ilan Strauss, investigates the societal impact of AI commercialization and advocates for transparency in corporate AI disclosures. Their latest working paper draws parallels between AI data practices and financial disclosure standards, arguing that increased transparency is essential for responsible AI development.

Key Findings from the Report

The study tested OpenAI’s models using a dataset of 34 legally obtained O’Reilly Media books, applying the DE-COP membership inference attack method to assess whether the models had prior exposure to the content.

Notable findings include:

GPT-4o exhibits a strong recognition of O’Reilly’s paywalled book content, with an AUROC score of 82%, whereas GPT-3.5 Turbo scored just above 50%.
GPT-4o shows higher recognition of non-public O’Reilly books (82%) compared to publicly available content (64%).
GPT-3.5 Turbo recognizes public O’Reilly content (64%) more than non-public samples (54%).
GPT-4o Mini, a smaller model, did not exhibit recognition of either public or non-public O’Reilly Media content, with an AUROC score of approximately 50%.

The researchers suggest that some training data may have come from sources like LibGen, a known database for pirated books, where all tested O’Reilly books were found.

Industry Implications and Regulatory Considerations

The report warns that AI companies using copyrighted materials without compensation could negatively impact professional content creation. If left unchecked, this practice could erode the diversity and quality of online content by undermining revenue streams for publishers and authors.

To address this, the AI Disclosures Project recommends:

Stronger transparency requirements for AI model training datasets.
The implementation of liability measures to ensure companies disclose data sources.
The establishment of formalized markets for training data licensing and creator remuneration.

The EU AI Act could play a pivotal role in enforcing disclosure standards, ensuring that content creators are informed when their work is used for AI model training.

The Future of AI Data Licensing

Despite the concerns raised, a legitimate market for AI training data is emerging. Companies like Defined.ai are already facilitating licensing agreements, enabling AI developers to obtain training data with proper permissions while ensuring privacy compliance.

The study concludes that OpenAI’s GPT-4o model likely trained on non-public, copyrighted O’Reilly Media content, emphasizing the urgent need for regulatory oversight and ethical AI data practices.