How to archive public documentation of trackers? #36

zner0L · 2023-08-31T12:50:28Z

As we discussed in #3, we need some kind of archiving solution that can be trusted and that is good enough for archiving modern, JS infested websites with potentially hidden content. We decided to use ArchiveBox, but while researching on how to set it up, I stumbled upon https://webrecorder.net/tools, who have offer very good archiving tools, which they also host at https://conifer.rhizome.org/ and they also have a self hosted service: https://github.com/webrecorder/browsertrix-cloud

Though I think we could also do easily by just having a folder of WACZ files generated by https://archiveweb.page/ in a Cloud somewhere. What do you think?

baltpeter · 2023-08-31T13:08:10Z

Interesting idea, I hadn't considered Webrecorder. But when I tested them a while back, it did produce very high-fidelity results.

I'm definitely happy if we don't have to host our own stuff. But I do like the simplicity of having a screenshot/PDF. Pretty much anyone can understand/open that.

baltpeter · 2023-08-31T13:10:29Z

I don't think the hosted version is an option for us, unfortunately. First like I tried didn't work at all. :/

With "Use Current Browser", it only showed me a Cloudflare page and produced a broken snapshot: https://conifer.rhizome.org/einbenni/tweasel-tracker-research/20230831125614/https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs?__cf_chl_rt_tk=Y_8BpAkKD6JEL0fuG9C25M1j69rQvw_50xxYK.bQz0g-1693486574-0-gaNycGzNCyU

Using one of their hosted browsers wasn't much better: https://conifer.rhizome.org/einbenni/tweasel-tracker-research/20230831130617$br:chrome:76/https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs?__cf_chl_rt_tk=Ca3lMAxwRDNfMOPVQCw62S_Fkwq.nATI.xX8jT89XOY-1693487178-0-gaNycGzNCbs

baltpeter · 2023-08-31T13:10:51Z

Also… :|

zner0L · 2023-08-31T13:11:53Z

The conifer service isn‘t much good though, or so it seems. My tests were all pretty bad. Webrecoder is really nice, still.

I do agree, that screenshots and PDF are very accessible also. But Webrecorder might be more accurate and they also produce good results for hidden content, it seems.

zner0L · 2023-09-01T17:44:59Z

How would we add links to public archives in our current schema for the reasoning? We can only add one link, and it would sense to have this be the archived link, but I would like to have the original link there as well…

baltpeter · 2023-09-01T17:53:22Z

I had two ideas:

An object ({ link: string, archived: string[] }), possibly with the script automatically replacing a URL string with that.
But then you have to do code mods in the script, which… ugh… So maybe just a CSV/SQLite DB that records date, adapter, URL, archived URLs?

zner0L · 2023-09-01T17:57:49Z

Yes, an object seems like the obvious way to go, but this would also break the API for the reasoning field. I guess we could also have some kind of linktree kind of thing that takes the URL and provides a link to the original and the archived version? Like https://sources.tweasel.org/https%3A%2F%2Fsupport.singular.net%2Fhc%2Fen-us%2Farticles%2F360037581952--UPDATED-Android-SDK-Integration-Guide%2325_Sending_the_User_ID_to_Singular_Optional or something similar?

baltpeter · 2023-09-04T07:49:48Z

I guess we could also have some kind of linktree kind of thing that takes the URL and provides a link to the original and the archived version?

Yeah, I guess. That would work well with option 2.

baltpeter · 2023-09-04T08:12:29Z

The conifer service isn‘t much good though, or so it seems. My tests were all pretty bad. Webrecoder is really nice, still.

I do agree, that screenshots and PDF are very accessible also. But Webrecorder might be more accurate and they also produce good results for hidden content, it seems.

Hm, if we can't use the hosted service and have to manually make snapshots using a Chrome extension/desktop app, I don't really see the value for us. I mean, high-fidelity archiving is nice and all, but we're not trying to archive Flash games here. If we already have to do manual work anyway, wouldn't a screenshot and/or PDF not be much better for our use case?

baltpeter · 2023-09-11T07:49:12Z

I've been using a local ArchiveBox instance to archive all interesting tracker documentation pages I've encountered. While doing that, I've discovered yet another annoying quirk. I've been using the CLI to add new pages to archive (reasoning).

I'm currently at 27 archived pages. A lot of them have extractors that failed (the archive.org extractor almost always fails with Failed to find "content-location" URL header in Archive.org response.). When I want to archive new URLs, it first goes through all previously failed snapshots and tries them again. Especially the archive.org extractor takes quite a long time. If you archive a PDF, it will also try all the extractors that will never work (screenshot, PDF, etc.) and each one will fail with Extractor timed out after 600s. every. single. time. And there is no parallelism, so this takes forever.

I've added a few new URLs almost 45 minutes ago and it hasn't even started archiving those, it's still stuck (unsuccessfully) retrying old snapshots. And of course, this is only going to get (a lot) worse the more pages we archive.

At the same time, snapshots do also sometimes fail temporarily, so it is important that they are retried. I'm just really not a fan of the architecture design here. There should be a background job that does that automatically in the background without interfering with new snapshots (and with an exponential backoff, of course).

zner0L · 2023-09-12T08:43:29Z

Yeah, I guess. That would work well with option 2.

If that works for you, I’d start building something like this and maybe also build a plugin for ArchiveBox to automate sending the links there? I am currently thinking of a very simple server doing just a complete text match lookup in the database and serving the links found. Links could be added using a REST API/a simple js cli.

baltpeter · 2023-09-12T08:50:25Z

also build a plugin for ArchiveBox

I'm really not convinced that AB is the way to go (see my comment above). It just has so many fundamental problems.

I am currently thinking of a very simple server doing just a complete text match lookup in the database and serving the links found.

Are you sure we want a server for that? I'd definitely want the archived links to be part of the TrackHAR repo, so there's no need for a PUT API. And if we're only GETing, you could just 'compile' the list to _redirects.

zner0L · 2023-09-12T09:15:05Z

I'd definitely want the archived links to be part of the TrackHAR repo

And how would you save those then? As a big JSON file?

zner0L · 2023-09-12T09:32:08Z

I'm really not convinced that AB is the way to go (see my comment above). It just has so many fundamental problems.

True, but what is our alternative? We have more or less ruled out every archiving tool by now. So it seems like the only option would be to write our own toolchain :(

baltpeter · 2023-09-12T09:35:01Z

And how would you save those then? As a big JSON file?

I suggested CSV or SQLite above, those seem more suited for row-based entries. And how big do you expect that to get? So far, we typically have between 1 and at most 5 links per tracker.

True, but what is our alternative? We have more or less ruled out every archiving tool by now. So it seems like the only option would be to write our own toolchain :(

I don't know. :(

zner0L · 2023-09-12T10:26:53Z

Ok, so maybe let’s collect a list of requirements for our archiving toolchain, then:

Push links to public archives
Take screenshots
Create snapshots that are not just Cloudflare pages
Collaboration, but restricted to specific users
Bonus: push archived links into our link database

Anything to add, @baltpeter?

I found that Zotero is actually quite a useable solution, if we add https://robustlinks.mementoweb.org/zotero/ as a plugin. However, it doesn’t take screenshots (and I didn’t find a plugin for that) and the free Group sync storage is only 300 MB.

Otherwise, I think the only remaining option would be to chain https://github.com/oduwsdl/archivenow, pywb and some kind of screenshot script.

baltpeter · 2023-09-12T10:42:47Z

I think the requirements sound about right.

But do we actually actually need an HTML capture like Zotero produces? I'd say a screenshot/PDF that actually contains the content we're interested in is sufficient. And if that's the case, I don't really see the value of using Zotero anymore.

Maybe we need to yet again scale back our expectations. How about:

We have a "daemon" that continuously checks all referenced links in the adapters and for everything that isn't in the archive database yet, it tries to use the save page now API and add it to the DB.
We manually take screenshots and PDFs, and upload them to a Nextcloud folder or whatever. Yes, that's annoying and I initially said I really don't want to do this, but honestly, given the alternatives that almost sounds like the best solution. And now that I've actually done a significant chunk of the adapter work, I know that we're maybe talking about 50 links in total (or let's say 100 to give a definite upper limit). Hardly sounds worth putting all that effort into automation for that, does it?
The linting script warns you on commit if some of the links have not been archived. You have to do those manually, likely using archive.today.

baltpeter · 2023-09-12T10:44:30Z

Every automated solution for the second step is just inevitably going to run into the same problems, and solving them ourselves isn't worth it given the relatively small amount of manual work that would replace imo.

zner0L · 2023-09-12T12:31:48Z

Do we want these linktree pages in a separate subdomain or can they just live at docs.tweasel.org/sources/<link>?

baltpeter · 2023-09-12T12:36:48Z

or can they just live at docs.tweasel.org/sources/<link>?

Sounds good to me.

zner0L · 2023-09-12T12:49:53Z

Ok, and would you rather have only the original link in the reasoning field and have a file watcher try to archive that if it isn’t in the database already without toughting the link in the reasoning, or do you think we should rewrite the reasoning to docs.tweasel.org/sources/<link> in the adapter? Or just in the generated docs?

baltpeter · 2023-09-12T13:24:13Z

Ok, and would you rather have only the original link in the reasoning field and have a file watcher try to archive that if it isn’t in the database already without toughting the link in the reasoning, or do you think we should rewrite the reasoning to docs.tweasel.org/sources/<link> in the adapter?

Well, the latter option would still require a file watcher, wouldn't it?

But I prefer the former either way. Seems redundant and unnecessary to always add the prefix. I'm not even sure whether I would link to that by default on the site (usually, it's an unnecessary extra click).

zner0L · 2023-09-13T14:25:52Z

What kind of behavior do we want on errors for the archiving? I am a little worried that we will send too many requests to the IA, if we re-check erroneous captures on every save. I’d suggest to write the errors to the database and check for them in the precommit hook, except maybe for retryable errors, like a timeout. I decided to use a CSV because I like the simplicity, so I don’t want to implement some kind of capture tracker of how often we tried to capture a particular page.

baltpeter · 2023-09-13T14:49:39Z

Why not just add a "last tried" column to the CSV for errors and then do exponential backoff? Ok wait, two columns, you'd also need #tries for that.

baltpeter · 2023-09-13T16:23:37Z

Actually, I think it would be even better if we only stored that information locally (in a gitignored file, not in the CSV), since we expect such errors to be resolved manually before merging to main anyway.

baltpeter · 2023-11-23T09:18:00Z

I'm not quite sure anymore what your status is on working on this, so you may have these on your TODO list already, but two things I noticed while reviewing your latest PRs:

How do we want to deal with links in our own research docs? Currently, these aren't handled at all, right? Iirc I did manually add an archived version of every link in parentheses when writing them. I guess that's enough for now (since we have already spent way too much time on this issue… :|). But it should definitely be documented in the README, then.
We said we wanted to manually grab a screenshot/PDF of each archived page. Did you start with that yet?

zner0L added discussion documentation Improvements or additions to documentation labels Aug 31, 2023

zner0L self-assigned this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to archive public documentation of trackers? #36

How to archive public documentation of trackers? #36

zner0L commented Aug 31, 2023

baltpeter commented Aug 31, 2023

baltpeter commented Aug 31, 2023

baltpeter commented Aug 31, 2023

zner0L commented Aug 31, 2023

zner0L commented Sep 1, 2023

baltpeter commented Sep 1, 2023

zner0L commented Sep 1, 2023

baltpeter commented Sep 4, 2023

baltpeter commented Sep 4, 2023

baltpeter commented Sep 11, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023 •

edited

Loading

baltpeter commented Sep 12, 2023 •

edited

Loading

baltpeter commented Sep 12, 2023 •

edited

Loading

zner0L commented Sep 12, 2023 •

edited

Loading

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 13, 2023

baltpeter commented Sep 13, 2023

baltpeter commented Sep 13, 2023

baltpeter commented Nov 23, 2023

How to archive public documentation of trackers? #36

How to archive public documentation of trackers? #36

Comments

zner0L commented Aug 31, 2023

baltpeter commented Aug 31, 2023

baltpeter commented Aug 31, 2023

baltpeter commented Aug 31, 2023

zner0L commented Aug 31, 2023

zner0L commented Sep 1, 2023

baltpeter commented Sep 1, 2023

zner0L commented Sep 1, 2023

baltpeter commented Sep 4, 2023

baltpeter commented Sep 4, 2023

baltpeter commented Sep 11, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023 • edited Loading

baltpeter commented Sep 12, 2023 • edited Loading

baltpeter commented Sep 12, 2023 • edited Loading

zner0L commented Sep 12, 2023 • edited Loading

baltpeter commented Sep 12, 2023

zner0L commented Sep 12, 2023

baltpeter commented Sep 12, 2023

zner0L commented Sep 13, 2023

baltpeter commented Sep 13, 2023

baltpeter commented Sep 13, 2023

baltpeter commented Nov 23, 2023

zner0L commented Sep 12, 2023 •

edited

Loading

baltpeter commented Sep 12, 2023 •

edited

Loading

baltpeter commented Sep 12, 2023 •

edited

Loading

zner0L commented Sep 12, 2023 •

edited

Loading