Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the W3C TDM Reservation Protocol and enable a more standard opt-out mechanism #308

Open
llemeurfr opened this issue May 8, 2023 · 7 comments

Comments

@llemeurfr
Copy link

The solution which was chosen by the author after a heated discussion in #293 was to support an opt-out expressed in http headers, via well know values "noindex" and "noimageindex" plus the ad-hoc values "noai" and "noimageai".

This is already a good move: in Europe, any crawler associated with TDM and AI technologies MUST support opt-out, as stipulated by the European DSM Directive. You'll get more information here about that legal requirement. Because this soft is gathering images available for AI training, it should not integrate in its dataset images for which authors have decided an opt-out.

But "noai" and "noimageia" are not well known tokens (even if you're not alone trying them), there is nothing standard in them so far. And robots.txt is not only about http headers. Directives can be in a file stored at the root of the web site (and as html meta, but this is not interesting here). Therefore your move does not really help the community establishing trusted relationships between AI solutions and content providers (which is a requirement if you want content providers to see AI actors as partners, not enemies).

For this reason, a W3C Community Group constituted of content providers and TDM actors decided to create an open specification two years ago, and released this specification called TDMRep (for TDM Reservation Protocol). The home page of the group is there; 42 participants.

For those wondering, this specification also covers AI solutions. And this group didn't use robots.txt for clear reasons.

Adding the support of a new property in the http header, called "tdm-reservation", and filtering images if its value is 1 (number) is a no-brainer. Adding the support of a JSON file named tdmrep.json, hosted in the /.well-known repository of the Web server on which the image is stored, is a bit more complex, but still easy in Python (it is identical to the processing of the robots.txt file); and its is mandatory even if less performant.

@maathieu
Copy link

maathieu commented May 8, 2023

Hi @llemeurfr , the solution proposed in your documents is not an established standard, I would not suggest to go for it before it becomes official as this would require a lot of time and expense from the developer of this repository. Whereas robots.txt is already an established practice which already covers the scenario of limiting scraping to authorized sections of a web site.

@llemeurfr
Copy link
Author

Hi @maathieu,
To what extent is X-Robots-Tag: noai more standard than tdm-reservation: 1? A Web server must in both cases be tuned to generate the property.

The evolution of is_disallowed() is far from complex. Would it help if contributors propose a PR?

The main complexity will in any case be to handle a robots.txt or tdmrep.json file.

Note: the W3C Community Group is waiting for more feedback from implementers before going through the W3C Recommendation route.

@rom1504
Copy link
Owner

rom1504 commented May 8, 2023

Adding a new header in the set of disallowed ones seems indeed fine.
Note img2dataset did not introduce the noai and noimageai tags, there were suggested by DeviantArt, see #218
It would be great to see a set of opt out headers standardized instead of each website creating their own.

As for tdmrep.json, i am wondering if you considered specifying in the response headers that this file exists? If that were the case, it would be possible to check this file only for the (initially small minority) of websites providing it.

@llemeurfr
Copy link
Author

It would be great to see a set of opt out headers standardized instead of each website creating their own.

YES, this is why we would all like to have a worldwide standard for TDM and AI opt-out (and because a raw opt-out is not great for the future of AI, we're trying to allow deals to be made).

As for tdmrep.json, i am wondering if you considered specifying in the response headers that this file exists?

No, we didn't consider that. A simple HEAD on tdmrep.json that responds with a 404 is not a huge lack of performance. PS: You will have to parse robots.txt if you want to check its rules, which takes much more time.

@DavidJCobb
Copy link

DavidJCobb commented May 17, 2023

To what extent is X-Robots-Tag: noai more standard than tdm-reservation: 1?

Neither of them is "more standard." TDMRep is a draft specification, it (by necessity) explicitly states that it's not a W3C standard, and it only has one listed editor: you. I counted participants in the mailing by hand and saw 20 unique names, which isn't even enough for the "There are dozens of us! Dozens!" meme. It could be a very good idea for a standard; it could be very well-designed; but right now, it isn't standard, isn't widely implemented, and wouldn't address any of the concerns people have with this repo.

Using your not-a-standard wouldn't make the maintainer of this project any less inconsiderate or any less of a disingenuous clown.

@maathieu
Copy link

maathieu commented May 25, 2023

@DavidJCobb , good point, that was a nice strawman argument from @llemeurfr . Robots.txt does not require any server tuning. Just place the file in the root directory. The scraper will download the file once, then compare all the URLs it wishes to scrape to the data placed in robots.txt . If there is no match, then it can download. If there is a match, then no download. There is no need to invent something new and convoluted to replace this established practice. A scraper is not different from a search engine spider. Google has also been doing "AI" for decades, and it has respected robots.txt on webmaster's websites. Please be good netizens. It's not because AI is the shiny new thing that all established practices must be abandoned. And also, be kindly remembered that servers are physical resources with associated costs. Whatever the purpose of the scraping, you are doing it out of the goodwill of website owners and server administrators. Do it responsibly.

Edit: there is apparently work in progress to include robots.txt support in the scraper: #302 . Looking forward :-)

@Padge91
Copy link

Padge91 commented Jul 11, 2023

We (Spawning) are maintaining datadiligence, a package responsible for filtering opt-outs when using img2dataset and similar tools. datadiligence currently supports the existing noai HTTP headers, the proposed TDMRep HTTP headers, and the Spawning API, with additional methods planned. While the package doesn't directly support the tdmrep.json or ai.txt files just yet, the Spawning API does.

...I would not suggest to go for it before it becomes official as this would require a lot of time and expense from the developer of this repository

We made #312 to replace the opt-out logic in img2dataset with calls to the datadiligence package to help keep the maintainers of this repository focused. Accepting the changes in the PR would resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants