Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small DC scraper problems #633

Open
stucka opened this issue Mar 25, 2024 · 2 comments
Open

Small DC scraper problems #633

stucka opened this issue Mar 25, 2024 · 2 comments

Comments

@stucka
Copy link
Contributor

stucka commented Mar 25, 2024

The scraper assumes a link for 2014 is listed but echoes a different year. For 2024's index, at least, that is not the case; no link to 2014 is offered, and therefore the scraper will not automatically try to download 2014.

Scraper grabs old HTML even when there's no update, e.g., in 2024 maybe we want to update 2024 and 2023, but we don't need to grab 2019 again. A small optimization might be to try to hit older URLs with utils.fetch_if_not_cached and then read from it the files with cache.

@stucka
Copy link
Contributor Author

stucka commented Mar 25, 2024

I'm sure there was a reason to use this uuid thing instead of labeling the files with the years, but I'm not seeing it at the moment. =)

@stucka
Copy link
Contributor Author

stucka commented Mar 25, 2024

Also should maybe nix the http:// prefix in the use of 2014.

chriszs added a commit to chriszs/warn-scraper that referenced this issue Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant