-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add web archive as a final fallback [WIP] #40
base: dev
Are you sure you want to change the base?
Conversation
This implements the web archive as a potential fallback for when the archive is no longer available during a batch download. This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.
Add ARCHIVE_ORG optional config to enable this functionality. If image does not exist on internet archive you will still get a 404. If they are down you will get a random different error. The functionality is disabled by default.
I'm wrapping the code in a config option in order to not add any extra stress to the server right now. The code exists but it is disabled by default. Checks for option ARCHIVE_ORG to be set to true. There might be more graceful ways to implement this but this is how I've done it. |
Moved the pull so it isn't part of the main loop. I did duplicate the download image function with the only part removing the server loop and replacing the image download part with the one that rewrites and attempts to download the web archive version. Previous version of the code tried to download from the archvie 5 times (once for each attempt pass) and I didn't realize this at first. This one only does it once and only after the 5 tries have been tried with the regular servers. Ideally I'd like to reduce the code duplicate (or eliminate outright) and have it abort the archive attempt at the first error. At least it is better about it now.
Moved the web archive code out of the main download loop. Fixes the following items.
Current patch duplicates the I'll try to see about cleaning up the code so it doesn't duplicate but not sure on the approach. I'd also like to have the archive attempt hard fail on the first error. No point in having it go further because the archive will be incomplete at that point. A person can try again later if the issue is due to archive.org being down but if you got another 404 chances are it won't get fixed in the future. 404 error happens when the Internet Archive has not archived a specific file or page. |
@@ -218,6 +219,44 @@ impl Hentai | |||
} | |||
if image_download_success {break;} // if all images were downloaded successfully: continue with cbz creation | |||
} | |||
if !image_download_success && webarchive == true { // Web Archive Loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to download images from web archive, only the metadata as the images should not have been purged from the nhentai media servers (see famously 177013, last section in my readme). Correct me if I'm wrong.
Then hypothetically speaking, if we wanted to download images from the web archive: Completely duplicating the outer download logic is likely unnecessary. If I see this correctly, the only thing that's different is that you call a different download image function Self::archive_image
. Why didn't you just implement fallback to web archive logic in the original download_image
function?
Ah okay, I should've read all of your comments first before making my review comments. :D It's good to see that we're on the same page here that the current duplicated code would not be ideal. But again, I don't really want to download images from the web archive because they're under enough stress as it is, I would only like to download metadata that has been purged from nhentai. |
For me the point would be to grab what I can from the IA but at the moment I've been just running it in a different set of configuration on galleries that fail to grab images from the main servers directly then bailing when it throws an error. At this point I'd already have metadata because it is stored in the DB from the original API pulls. This isn't setup to get any metadata from IA. For now this is just taking my own idea and running with it but I'm also learning Rust a bit with this so I do not expect you to commit this at all. You did say you added it to your ideas list so I expect that you may implement this in your own way at a later date which would fit much better in the code. |
e4ea1f4
to
5507710
Compare
This implements the web archive as a potential fallback for when the archive is no longer available during a batch download. This does not allow for downloading removed archives if you do not already have the image metadata in your local database unless you add that data yourself.
You may want to wait to merge (if at all) until the web archive is actually stable. It seems to go up and down. I added a warn line right before it downloads the image that indicates it is getting the copy from the internet archive. Mostly for myself since I'll be using this code and want to make sure the images are pulled properly.
This is for #39