-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to archive public documentation of trackers? #36
Comments
Interesting idea, I hadn't considered Webrecorder. But when I tested them a while back, it did produce very high-fidelity results. I'm definitely happy if we don't have to host our own stuff. But I do like the simplicity of having a screenshot/PDF. Pretty much anyone can understand/open that. |
I don't think the hosted version is an option for us, unfortunately. First like I tried didn't work at all. :/ With "Use Current Browser", it only showed me a Cloudflare page and produced a broken snapshot: https://conifer.rhizome.org/einbenni/tweasel-tracker-research/20230831125614/https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs?__cf_chl_rt_tk=Y_8BpAkKD6JEL0fuG9C25M1j69rQvw_50xxYK.bQz0g-1693486574-0-gaNycGzNCyU Using one of their hosted browsers wasn't much better: https://conifer.rhizome.org/einbenni/tweasel-tracker-research/20230831130617$br:chrome:76/https://support.singular.net/hc/en-us/articles/4411780525979-Types-of-Device-IDs?__cf_chl_rt_tk=Ca3lMAxwRDNfMOPVQCw62S_Fkwq.nATI.xX8jT89XOY-1693487178-0-gaNycGzNCbs |
The conifer service isn‘t much good though, or so it seems. My tests were all pretty bad. Webrecoder is really nice, still. I do agree, that screenshots and PDF are very accessible also. But Webrecorder might be more accurate and they also produce good results for hidden content, it seems. |
How would we add links to public archives in our current schema for the reasoning? We can only add one link, and it would sense to have this be the archived link, but I would like to have the original link there as well… |
I had two ideas:
|
Yes, an object seems like the obvious way to go, but this would also break the API for the reasoning field. I guess we could also have some kind of linktree kind of thing that takes the URL and provides a link to the original and the archived version? Like |
Yeah, I guess. That would work well with option 2. |
Hm, if we can't use the hosted service and have to manually make snapshots using a Chrome extension/desktop app, I don't really see the value for us. I mean, high-fidelity archiving is nice and all, but we're not trying to archive Flash games here. If we already have to do manual work anyway, wouldn't a screenshot and/or PDF not be much better for our use case? |
I've been using a local ArchiveBox instance to archive all interesting tracker documentation pages I've encountered. While doing that, I've discovered yet another annoying quirk. I've been using the CLI to add new pages to archive (reasoning). I'm currently at 27 archived pages. A lot of them have extractors that failed (the archive.org extractor almost always fails with I've added a few new URLs almost 45 minutes ago and it hasn't even started archiving those, it's still stuck (unsuccessfully) retrying old snapshots. And of course, this is only going to get (a lot) worse the more pages we archive. At the same time, snapshots do also sometimes fail temporarily, so it is important that they are retried. I'm just really not a fan of the architecture design here. There should be a background job that does that automatically in the background without interfering with new snapshots (and with an exponential backoff, of course). |
If that works for you, I’d start building something like this and maybe also build a plugin for ArchiveBox to automate sending the links there? I am currently thinking of a very simple server doing just a complete text match lookup in the database and serving the links found. Links could be added using a REST API/a simple js cli. |
I'm really not convinced that AB is the way to go (see my comment above). It just has so many fundamental problems.
Are you sure we want a server for that? I'd definitely want the archived links to be part of the TrackHAR repo, so there's no need for a PUT API. And if we're only GETing, you could just 'compile' the list to |
And how would you save those then? As a big JSON file? |
True, but what is our alternative? We have more or less ruled out every archiving tool by now. So it seems like the only option would be to write our own toolchain :( |
I suggested CSV or SQLite above, those seem more suited for row-based entries. And how big do you expect that to get? So far, we typically have between 1 and at most 5 links per tracker.
I don't know. :( |
Ok, so maybe let’s collect a list of requirements for our archiving toolchain, then:
Anything to add, @baltpeter? I found that Zotero is actually quite a useable solution, if we add https://robustlinks.mementoweb.org/zotero/ as a plugin. However, it doesn’t take screenshots (and I didn’t find a plugin for that) and the free Group sync storage is only 300 MB. Otherwise, I think the only remaining option would be to chain https://github.com/oduwsdl/archivenow, pywb and some kind of screenshot script. |
I think the requirements sound about right. But do we actually actually need an HTML capture like Zotero produces? I'd say a screenshot/PDF that actually contains the content we're interested in is sufficient. And if that's the case, I don't really see the value of using Zotero anymore. Maybe we need to yet again scale back our expectations. How about:
|
Every automated solution for the second step is just inevitably going to run into the same problems, and solving them ourselves isn't worth it given the relatively small amount of manual work that would replace imo. |
Do we want these linktree pages in a separate subdomain or can they just live at |
Sounds good to me. |
Ok, and would you rather have only the original link in the |
Well, the latter option would still require a file watcher, wouldn't it? But I prefer the former either way. Seems redundant and unnecessary to always add the prefix. I'm not even sure whether I would link to that by default on the site (usually, it's an unnecessary extra click). |
What kind of behavior do we want on errors for the archiving? I am a little worried that we will send too many requests to the IA, if we re-check erroneous captures on every save. I’d suggest to write the errors to the database and check for them in the precommit hook, except maybe for retryable errors, like a timeout. I decided to use a CSV because I like the simplicity, so I don’t want to implement some kind of capture tracker of how often we tried to capture a particular page. |
Why not just add a "last tried" column to the CSV for errors and then do exponential backoff? Ok wait, two columns, you'd also need #tries for that. |
Actually, I think it would be even better if we only stored that information locally (in a gitignored file, not in the CSV), since we expect such errors to be resolved manually before merging to |
I'm not quite sure anymore what your status is on working on this, so you may have these on your TODO list already, but two things I noticed while reviewing your latest PRs:
|
As we discussed in #3, we need some kind of archiving solution that can be trusted and that is good enough for archiving modern, JS infested websites with potentially hidden content. We decided to use ArchiveBox, but while researching on how to set it up, I stumbled upon https://webrecorder.net/tools, who have offer very good archiving tools, which they also host at https://conifer.rhizome.org/ and they also have a self hosted service: https://github.com/webrecorder/browsertrix-cloud
Though I think we could also do easily by just having a folder of WACZ files generated by https://archiveweb.page/ in a Cloud somewhere. What do you think?
The text was updated successfully, but these errors were encountered: