[go: nahoru, domu]

mark nottingham

Considerations for AI Opt-Out

Sunday, 21 April 2024

Tech Regulation

Creating a Large Language Model (LLM) requires a lot of content – as implied by the name, LLMs need voluminous input data to be able to function well. Much of that content comes from the Internet, and early models have been seeded by crawling the whole Web.

This now widespread practice of ingestion without consent is contentious, to put it mildly. Content creators feel that they should be compensated or at least have a choice about how their content is used; AI advocates caution that without easy access to input data, their ability to innovate will be severely limited, thereby curtailing the promised benefits of AI.

The Policy Context

In the US, the Copyright Office has launched an initiative to examine this and other issues surrounding copyright and AI. So far, they have avoided addressing the ingestion issue, but nevertheless it has come up repeatedly in their public proceedings:

“The interests of those using copyrighted materials for AI ingestion purposes must not be prioritized over the rights and interests of creators and copyright owners.” – Keith Kupferschmid, Copyright Alliance

“Training of AI language models begins with copying, which we believe has infringed our copyrights and has already deprived us of hundreds of millions of dollars in rightful revenues.  The additional violation of our moral right of attribution makes it impossible to tell which of our works have been copied to train AI and thus frustrates redress for either the economic infringement or the violation of our moral right to object to use of our work to train AI to generate prejudicial content. […] OpenAI, for example, has received a billion dollars in venture capital, none of which has been passed on to the authors of the training corpus even though, without that training corpus, chatGPT would be worthless.” – Edward Hasbrouck, National Writers Union

It’s uncertain when (or if) the Copyright Office will provide more clarity on this issue. Also relevant in the US are the outcomes of cases like Getty Images (US), Inc. v. Stability AI, Inc.

However, Europe has been more definitive about the ingestion issue. Directive 2019/790 says:

The [exception for copyright] shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.1

This is reinforced by the recently adopted AI Act:

Any use of copyright protected content requires the authorisation of the rightsholder concerned unless relevant copyright exceptions and limitations apply. Directive (EU) 2019/790 introduced exceptions and limitations allowing reproductions and extractions of works or other subject matter, for the purpose of text and data mining, under certain conditions. Under these rules, rightsholders may choose to reserve their rights over their works or other subject matter to prevent text and data mining, unless this is done for the purposes of scientific research. Where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightsholders if they want to carry out text and data mining over such works.

In other words, European law is about to require commercial AI crawlers to support an opt-out. However, it does not specify a particular mechanism: it only says that consent must be ‘expressly reserved in an appropriate manner.’

So, what might that opt-out signal look like?

Robots.txt as an Opt-Out

Since most of the publicly available content on the Internet is accessed over the Web, it makes sense to consider how an opt-out might be expressed there as a primary mechanism. The Web already has a way for sites to opt-out of automated crawling: the robots.txt file, now specified by an IETF Standards-Track RFC.

At first glance, robots.txt intuitively maps to what’s required: a way to instruct automated crawlers on how to treat a site with some amount of granularity, including opting out of crawling altogether. Some LLMs have latched onto this it already; for example, OpenAI allows their crawler to be controlled by it.

There are a lot of similarities between gathering Web content for search and gathering it for an LLM: the actual crawler software is very similar (if not identical), crawling the whole Web requires significant resources, and both uses create enormous potential value not only for the operators of the crawlers, but also for society.

However, it is questionable whether merely reusing to robots.txt as the opt-out mechanism is sufficient to allow rightsholders to fully express their reservation. Despite the similarities listed above, it is hard to ignore the ways that LLM ingest is different.

That’s because Web search can be seen as a service to sites; it makes them more discoverable on the Web, and is thus symbiotic – both parties benefit. LLM crawling, on the other hand, doesn’t have any benefits to the content owner, and may be perceived as harming them.

Through the lenses of those different purposes and their associated power dynamics, a few issues become apparent.

1. Usability and Ecosystem Impact

Robots.txt allows sites to target directives to bots in two different ways: by path on the site (e.g., /images vs. /users) and by User-Agent. The User-Agent identifies the bot, allowing sites to specify things like “I allow Google to crawl my site, but not Bing.” Or, “I don’t allow any bots.”

That might be adequate for controlling how your site appears in search engines, but problematic when applied to AI. Let’s look at an example.

To stop OpenAI from crawling your site, you can add:

User-Agent: GPTBot
Disallow: /

However, that directive doesn’t apply to Google, Mistral, or any other LLM-in-waiting out there; you’d have to target each individual one (and some folks are already advising on how to do that).

If you miss one, that’s your fault, and it’ll be in that model forever, so careful (or just frustrated) people might decide to just ban everything:

User-Agent: *
Disallow /

But that has the downside of disallowing AI and search crawlers – even though presence in search engines is often critical to sites. To avoid that, you would have to enumerate all of the search engines and other bots that you want to allow, creating more work.

Significantly, doing so could also have a negative effect on the Web ecosystem: if sites have a stronger incentive to disallow unknown bots thanks to AI, it would be much harder to responsibly introduce new crawler-based services to the Web. That would tilt the table even further in the favour of already established ‘big tech’ actors.

There are two easy ways to fix these issues. One would be to define a special User-Agent that applies to all AI crawlers. For example:

User-Agent: AI-Ingest
Disallow: /

The other approach would be to create a new well-known location just for AI – for example /.well-known/ai.txt. That file might have the same syntax as robots.txt, or its notoriously quirky syntax could be ditched for something more modern.

Either solution above would make it easy for a site to opt-out of AI crawling of any sort without enumerating all of the potential AI crawlers in the world, and without impacting their search engine coverage or creating ecosystem risk.

I suspect that many have been assuming that one of these things will happen; they’re fairly obvious evolutions of existing practice. However, at least two more issues are still unaddressed.

2. Previously Crawled Content

Web search and LLMs also differ in how they relate to time.

A search engine crawler has a strong interest in assuring that its index reflects the current Web. LLM crawlers, on the other hand, are ravenous without regard to its age or current availability on the Web. Once ingested content forms part of a model, they add value to that model for the lifetime of its use – and the model often persists for months or even years after the ingested content was obtained. Furthermore, that content might be reused to create future models, indefinitely.

That means that a content owner who isn’t aware of the LLM crawler at crawl time doesn’t have any recourse. From the Copyright Office sessions:

We believe that writers should be compensated also for past training since it appears that the massive training that has already occurred for GPT and Bard to teach the engines to think and to write has already occurred[.] – Mary Rasenberger, The Authors Guild

This shortcoming could be addressed by a relatively simple measure: stating that the policy for a given URL applies to any use of content obtained from that URL at model creation time, regardless of when it was obtained.

A significant amount of detail would need to be specified to make this work, of course. It would also likely necessitate some sort of grandfathering or transition period for existing models.

Needless to say, the impact of this kind of change could be massive: if 90% of the sites in the world opt out in this fashion (a la App Tracking Transparency), it would be difficult to legally construct a new model (or at least market or use such a model in Europe, under the forthcoming rules).

On the other hand, if that many people don’t want to allow LLMs to use their content when offered a genuine chance to control it, shouldn’t their rights be honoured? Ultimately, if that’s the outcome, society will need to go back to the drawing board and figure out what it values more: copyright interests or the development of LLMs.

3. Control of Metadata

Another issue with reusing robots.txt is how that file itself is controlled. As a site-wide metadata mechanism, there is only one controller for robots.txt: the site administrator.

That means that on Facebook, Meta will decide whether your photos can be used to feed AI (theirs or others’), not you. On GitHub, Microsoft will decide how your repositories will be treated. And so on.

While robots.txt is great for single-owner sites (like this one), it doesn’t meet the needs of a concentrated world – it leverages the power that accrues to a small number of platform owners to decide policy for all of their users.

Avoiding that outcome means that users need to be able express their preference in the content itself, so that it persists no matter where it ends up. That means it’s necessary to be able to embed policy in things like images, videos, audio files, document formats like PDF, Office, and ePub, containers like ZIP files, file system paths for things like git repos, and so on. Assuming that a robots.txt-like approach is also defined, their relative precedence will also need to be specified.

Luckily, this is not a new requirement – our industry has considerable experience in embedding such metadata into file formats, for use cases like content provenance. It just needs to be specified for AI control.

What’s Next?

Policy decisions like that just made by Europe might be the drivers of change in LLM ingest practices, but I hope I’ve shown that the technical details of that ‘appropriate manner’ of opting out can significantly steer power between AI companies and content owners.

Notably, while the worldwide copyright regime is explicitly opt-in (i.e., you have to explicitly offer a license for someone to legally use your material, unless fair use applies), the European legislation changes this to opt-out for AI.2 Given that, offering content owners a genuine opportunity to do so is important, in my opinion.

I’ve touched on a few aspects that influence that opportunity above; I’m sure there are more.3 As I implied at the start, getting the balance right is going to take careful consideration and perhaps most importantly, sunlight.

However, It’s not yet clear where or how this work will happen. Notably, the standardisation request to the European Standardisation Organisations in support of safe and trustworthy artificial intelligence does not mention copyright at all. Personally, I think that’s a good thing – worldwide standards need to be in open international standards bodies like the IETF, not regionally fragmented.

In that spirit, the IETF has recently created a mailing list to discuss AI control. That’s likely the best place to follow up if you’re interested in discussing these topics.

  1. See also Recital 18

  2. And I suspect other jurisdictions might follow the same approach; time will tell. 

  3. For example, some of the input to the Copyright Office mentioned group licensing regimes. An opt-out mechanism could be adapted to support that.