Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to archive public documentation of trackers? #36

Open
zner0L opened this issue Aug 31, 2023 · 26 comments
Open

How to archive public documentation of trackers? #36

zner0L opened this issue Aug 31, 2023 · 26 comments
Assignees
Labels
discussion documentation Improvements or additions to documentation

Comments

@zner0L
Copy link

zner0L commented Aug 31, 2023

As we discussed in #3, we need some kind of archiving solution that can be trusted and that is good enough for archiving modern, JS infested websites with potentially hidden content. We decided to use ArchiveBox, but while researching on how to set it up, I stumbled upon https://webrecorder.net/tools, who have offer very good archiving tools, which they also host at https://conifer.rhizome.org/ and they also have a self hosted service: https://github.com/webrecorder/browsertrix-cloud

Though I think we could also do easily by just having a folder of WACZ files generated by https://archiveweb.page/ in a Cloud somewhere. What do you think?

@zner0L zner0L added discussion documentation Improvements or additions to documentation labels Aug 31, 2023
@baltpeter
Copy link
Member

Interesting idea, I hadn't considered Webrecorder. But when I tested them a while back, it did produce very high-fidelity results.

I'm definitely happy if we don't have to host our own stuff. But I do like the simplicity of having a screenshot/PDF. Pretty much anyone can understand/open that.

@baltpeter
Copy link
Member

Also… :|

image

@zner0L
Copy link
Author

zner0L commented Aug 31, 2023

The conifer service isn‘t much good though, or so it seems. My tests were all pretty bad. Webrecoder is really nice, still.

I do agree, that screenshots and PDF are very accessible also. But Webrecorder might be more accurate and they also produce good results for hidden content, it seems.

@zner0L
Copy link
Author

zner0L commented Sep 1, 2023

How would we add links to public archives in our current schema for the reasoning? We can only add one link, and it would sense to have this be the archived link, but I would like to have the original link there as well…

@baltpeter
Copy link
Member

I had two ideas:

  • An object ({ link: string, archived: string[] }), possibly with the script automatically replacing a URL string with that.
  • But then you have to do code mods in the script, which… ugh… So maybe just a CSV/SQLite DB that records date, adapter, URL, archived URLs?

@zner0L
Copy link
Author

zner0L commented Sep 1, 2023

Yes, an object seems like the obvious way to go, but this would also break the API for the reasoning field. I guess we could also have some kind of linktree kind of thing that takes the URL and provides a link to the original and the archived version? Like https://sources.tweasel.org/https%3A%2F%2Fsupport.singular.net%2Fhc%2Fen-us%2Farticles%2F360037581952--UPDATED-Android-SDK-Integration-Guide%2325_Sending_the_User_ID_to_Singular_Optional or something similar?

@baltpeter
Copy link
Member

I guess we could also have some kind of linktree kind of thing that takes the URL and provides a link to the original and the archived version?

Yeah, I guess. That would work well with option 2.

@baltpeter
Copy link
Member

The conifer service isn‘t much good though, or so it seems. My tests were all pretty bad. Webrecoder is really nice, still.

I do agree, that screenshots and PDF are very accessible also. But Webrecorder might be more accurate and they also produce good results for hidden content, it seems.

Hm, if we can't use the hosted service and have to manually make snapshots using a Chrome extension/desktop app, I don't really see the value for us. I mean, high-fidelity archiving is nice and all, but we're not trying to archive Flash games here. If we already have to do manual work anyway, wouldn't a screenshot and/or PDF not be much better for our use case?

@baltpeter
Copy link
Member

I've been using a local ArchiveBox instance to archive all interesting tracker documentation pages I've encountered. While doing that, I've discovered yet another annoying quirk. I've been using the CLI to add new pages to archive (reasoning).

I'm currently at 27 archived pages. A lot of them have extractors that failed (the archive.org extractor almost always fails with Failed to find "content-location" URL header in Archive.org response.). When I want to archive new URLs, it first goes through all previously failed snapshots and tries them again. Especially the archive.org extractor takes quite a long time. If you archive a PDF, it will also try all the extractors that will never work (screenshot, PDF, etc.) and each one will fail with Extractor timed out after 600s. every. single. time. And there is no parallelism, so this takes forever.

I've added a few new URLs almost 45 minutes ago and it hasn't even started archiving those, it's still stuck (unsuccessfully) retrying old snapshots. And of course, this is only going to get (a lot) worse the more pages we archive.

At the same time, snapshots do also sometimes fail temporarily, so it is important that they are retried. I'm just really not a fan of the architecture design here. There should be a background job that does that automatically in the background without interfering with new snapshots (and with an exponential backoff, of course).

@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

Yeah, I guess. That would work well with option 2.

If that works for you, I’d start building something like this and maybe also build a plugin for ArchiveBox to automate sending the links there? I am currently thinking of a very simple server doing just a complete text match lookup in the database and serving the links found. Links could be added using a REST API/a simple js cli.

@baltpeter
Copy link
Member

also build a plugin for ArchiveBox

I'm really not convinced that AB is the way to go (see my comment above). It just has so many fundamental problems.

I am currently thinking of a very simple server doing just a complete text match lookup in the database and serving the links found.

Are you sure we want a server for that? I'd definitely want the archived links to be part of the TrackHAR repo, so there's no need for a PUT API. And if we're only GETing, you could just 'compile' the list to _redirects.

@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

I'd definitely want the archived links to be part of the TrackHAR repo

And how would you save those then? As a big JSON file?

@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

I'm really not convinced that AB is the way to go (see my comment above). It just has so many fundamental problems.

True, but what is our alternative? We have more or less ruled out every archiving tool by now. So it seems like the only option would be to write our own toolchain :(

@baltpeter
Copy link
Member

And how would you save those then? As a big JSON file?

I suggested CSV or SQLite above, those seem more suited for row-based entries. And how big do you expect that to get? So far, we typically have between 1 and at most 5 links per tracker.

True, but what is our alternative? We have more or less ruled out every archiving tool by now. So it seems like the only option would be to write our own toolchain :(

I don't know. :(

@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

Ok, so maybe let’s collect a list of requirements for our archiving toolchain, then:

  • Push links to public archives
  • Take screenshots
  • Create snapshots that are not just Cloudflare pages
  • Collaboration, but restricted to specific users
  • Bonus: push archived links into our link database

Anything to add, @baltpeter?


I found that Zotero is actually quite a useable solution, if we add https://robustlinks.mementoweb.org/zotero/ as a plugin. However, it doesn’t take screenshots (and I didn’t find a plugin for that) and the free Group sync storage is only 300 MB.

Otherwise, I think the only remaining option would be to chain https://github.com/oduwsdl/archivenow, pywb and some kind of screenshot script.

@baltpeter
Copy link
Member

baltpeter commented Sep 12, 2023

I think the requirements sound about right.

But do we actually actually need an HTML capture like Zotero produces? I'd say a screenshot/PDF that actually contains the content we're interested in is sufficient. And if that's the case, I don't really see the value of using Zotero anymore.

Maybe we need to yet again scale back our expectations. How about:

  • We have a "daemon" that continuously checks all referenced links in the adapters and for everything that isn't in the archive database yet, it tries to use the save page now API and add it to the DB.
  • We manually take screenshots and PDFs, and upload them to a Nextcloud folder or whatever. Yes, that's annoying and I initially said I really don't want to do this, but honestly, given the alternatives that almost sounds like the best solution. And now that I've actually done a significant chunk of the adapter work, I know that we're maybe talking about 50 links in total (or let's say 100 to give a definite upper limit). Hardly sounds worth putting all that effort into automation for that, does it?
  • The linting script warns you on commit if some of the links have not been archived. You have to do those manually, likely using archive.today.

@baltpeter
Copy link
Member

baltpeter commented Sep 12, 2023

Every automated solution for the second step is just inevitably going to run into the same problems, and solving them ourselves isn't worth it given the relatively small amount of manual work that would replace imo.

@zner0L zner0L self-assigned this Sep 12, 2023
@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

Do we want these linktree pages in a separate subdomain or can they just live at docs.tweasel.org/sources/<link>?

@baltpeter
Copy link
Member

or can they just live at docs.tweasel.org/sources/<link>?

Sounds good to me.

@zner0L
Copy link
Author

zner0L commented Sep 12, 2023

Ok, and would you rather have only the original link in the reasoning field and have a file watcher try to archive that if it isn’t in the database already without toughting the link in the reasoning, or do you think we should rewrite the reasoning to docs.tweasel.org/sources/<link> in the adapter? Or just in the generated docs?

@baltpeter
Copy link
Member

Ok, and would you rather have only the original link in the reasoning field and have a file watcher try to archive that if it isn’t in the database already without toughting the link in the reasoning, or do you think we should rewrite the reasoning to docs.tweasel.org/sources/<link> in the adapter?

Well, the latter option would still require a file watcher, wouldn't it?

But I prefer the former either way. Seems redundant and unnecessary to always add the prefix. I'm not even sure whether I would link to that by default on the site (usually, it's an unnecessary extra click).

@zner0L
Copy link
Author

zner0L commented Sep 13, 2023

What kind of behavior do we want on errors for the archiving? I am a little worried that we will send too many requests to the IA, if we re-check erroneous captures on every save. I’d suggest to write the errors to the database and check for them in the precommit hook, except maybe for retryable errors, like a timeout. I decided to use a CSV because I like the simplicity, so I don’t want to implement some kind of capture tracker of how often we tried to capture a particular page.

@baltpeter
Copy link
Member

Why not just add a "last tried" column to the CSV for errors and then do exponential backoff? Ok wait, two columns, you'd also need #tries for that.

@baltpeter
Copy link
Member

Actually, I think it would be even better if we only stored that information locally (in a gitignored file, not in the CSV), since we expect such errors to be resolved manually before merging to main anyway.

@baltpeter
Copy link
Member

I'm not quite sure anymore what your status is on working on this, so you may have these on your TODO list already, but two things I noticed while reviewing your latest PRs:

  • How do we want to deal with links in our own research docs? Currently, these aren't handled at all, right? Iirc I did manually add an archived version of every link in parentheses when writing them. I guess that's enough for now (since we have already spent way too much time on this issue… :|). But it should definitely be documented in the README, then.
  • We said we wanted to manually grab a screenshot/PDF of each archived page. Did you start with that yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants