Big Data

OpenAI Media Manager will allow creators to block AI training


Discover how companies are responsibly integrating AI in production. This invite-only event in SF will explore the intersection of technology and business. Find out how you can attend here.


OpenAI has made a flurry of new updates today alone, but the biggest may be a new tool it is developing called “Media Manager,” due out next year in 2025, which will allow creators to choose which of their works — if any — they will allow to be scraped and trained on for the company’s AI models.

Announced in a blog post on the OpenAI website, the tool is described as follows:

OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Over time, we plan to introduce additional choices and features.

This will require cutting-edge machine learning research to build a first-ever tool of its kind to help us identify copyrighted text, images, audio, and video across multiple sources and reflect creator preferences.

VB Event

The AI Impact Tour – San Francisco

Join us as we navigate the complexities of responsibly integrating AI in business at the next stop of VB’s AI Impact Tour in San Francisco. Don’t miss out on the chance to gain insights from industry experts, network with like-minded innovators, and explore the future of GenAI with customer experiences and optimize business processes.


Request an invite

We’re collaborating with creators, content owners, and regulators as we develop Media Manager. Our goal is to have the tool in place by 2025, and we hope it will set a standard across the AI industry.

No price has yet been listed for the tool, and I’m guessing it will be offered for free since OpenAI is using it to position itself as an ethical actor.

The tool seeks to offer creators additional protections for AI data scraping beyond adding a string of code to the robots.txt file on their websites (“User-agent: GPTBot Disallow: /”), a measure that OpenAI introduced back in August 2023.

After all, many creators post work on sites that they don’t own or control — platforms such as DeviantArt or Pateron — where they would not be able to edit the robots.txt file on their pages. In addition, some creators may wish to exempt only certain works — not all of the things they post — from AI data scraping and training, so the Media Manager proposed by OpenAI would allow for this type of more granular control and optionality.

In addition, OpenAI notes that creators’ work can be readily screenshotted, saved, reshared, and otherwise reposted or redistributed across the web on domains that don’t offer the opt-out text.

“We understand these are incomplete solutions, as many creators do not control websites where their content may appear, and content is often quoted, reviewed, remixed, reposted and used as inspiration across multiple domains. We need an efficient, scalable solution for content owners to express their preferences about the use of their content in AI systems.”

A response to strong and persistent criticism of AI data scraping

The moves come amid an ongoing wave of visual artists and creators objecting to AI model makers such as OpenAI and its rivals Anthropic, Meta, Cohere and others scraping the web for data to train on without their express permission, consent, or compensation.

Several creators have filed class action lawsuits against OpenAI and other AI companies alleging this practice of data scraping violates the copyright of the creators’ images and works.

OpenAI’s defense is that web crawling and scraping has been an accepted and standard practice among many companies across the web for decades now, and it alludes to this argument again in today’s blog post, writing: “Decades ago, the robots.txt standard was introduced and voluntarily adopted by the Internet ecosystem for web publishers to indicate what portions of websites web crawlers could access.”

Indeed, many artists tacitly accepted the scraping of their data for indexing in search engines such as Google, yet object to generative AI training on it, because it competes more directly with their own work product and livelihoods.

OpenAI offers indemnification — guarantees of legal assistance and defense — for subscribers to its paid plans accused of copyright infringement, a bid to reassure its growing list of lucrative enterprise customers.

The courts have yet to rule decisively on whether AI companies and others can scrape copyrighted creative works without express consent or permission of the creators. But clearly, regardless of how it is settled legally, OpenAI wants to position itself as a cooperative and ethical entity with regards to creators and its data sources.

That said, creators are likely to view this move as “too little, too late” since many of their works have already presumably been scraped and used to train AI models, and OpenAI is nowhere suggesting it could or would remove the portions of its models trained on such works.

In its blog post, OpenAI makes the argument that it does not preserve copies of scraped data wholesale, only “an equation that best describes the relationship among the words and the underlying process that produced them.”

As the company writes:

We design our AI models to be learning machines, not databases

Our models are designed to help us generate new content and ideas – not to repeat or “regurgitate” content. AI models can state facts, which are in the public domain. If on rare occasions a model inadvertently repeats expressive content, it is a failure of the machine learning process. This failure is more likely to occur with content that appears frequently in training datasets, such as content that appears on many different public websites due to being frequently quoted. We employ state-of-the-art techniques throughout training and at output, for our API or ChatGPT, to prevent repetition, and we’re continually making improvements with on-going research and development.

At the very least, the Media Manager tool may be a more efficient and user friendly way to block AI training than other existing options such as Glaze and Nightshade, though if it is coming from OpenAI, it is not clear yet whether creators will even trust it — nor whether it will be able to block training by other rival models.