Small DC scraper problems #633

stucka · 2024-03-25T14:43:22Z

The scraper assumes a link for 2014 is listed but echoes a different year. For 2024's index, at least, that is not the case; no link to 2014 is offered, and therefore the scraper will not automatically try to download 2014.

Scraper grabs old HTML even when there's no update, e.g., in 2024 maybe we want to update 2024 and 2023, but we don't need to grab 2019 again. A small optimization might be to try to hit older URLs with utils.fetch_if_not_cached and then read from it the files with cache.

The text was updated successfully, but these errors were encountered:

stucka · 2024-03-25T14:44:27Z

I'm sure there was a reason to use this uuid thing instead of labeling the files with the years, but I'm not seeing it at the moment. =)

stucka · 2024-03-25T14:55:17Z

Also should maybe nix the http:// prefix in the use of 2014.

Addresses biglocalnews#633

stucka mentioned this issue Mar 25, 2024

Replace deprecated http links #622

Closed

4 tasks

chriszs added a commit to chriszs/warn-scraper that referenced this issue Mar 27, 2024

Clean up DC by removing dead code

f36efcd

Addresses biglocalnews#633

chriszs mentioned this issue Mar 27, 2024

Clean up DC by removing dead code #641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small DC scraper problems #633

Small DC scraper problems #633

stucka commented Mar 25, 2024

stucka commented Mar 25, 2024

stucka commented Mar 25, 2024

Small DC scraper problems #633

Small DC scraper problems #633

Comments

stucka commented Mar 25, 2024

stucka commented Mar 25, 2024

stucka commented Mar 25, 2024