more (re)scrape work neeed? #791

philbudne · 2024-09-19T18:43:29Z

I noticed this in the logs of the worker container (note: current date is 2024-09-18):

2024-09-16T21:44:29.615724832Z ==== starting _scrape_source(42118, https://www.itv.com/news/topic/nigeria, itv.com)
2024-09-16T21:44:29.621624096Z add_line: Scraped source 42118 (itv.com), https://www.itv.com/news/topic/nigeria
2024-09-16T21:44:29.622858689Z Starting new HTTPS connection (1): www.itv.com:443
2024-09-16T21:44:59.622339377Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:44:59.622713877Z Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.itv.com', port=443): Read timed out. (read timeout=None)")': /news/topic/nigeria
2024-09-16T21:44:59.623125050Z Starting new HTTPS connection (2): www.itv.com:443
2024-09-16T21:56:42.969479886Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:56:43.170047325Z Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T21:56:43.170361434Z Starting new HTTPS connection (3): www.itv.com:443
2024-09-16T22:08:25.437442087Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2024-09-16T22:08:25.838354746Z Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T22:08:25.838934554Z Starting new HTTPS connection (4): www.itv.com:443

Which suggests recscrape is hanging?

It also looks like each connection (re)try is taking 11 minutes?!

mcweb's call to feed_seeker is:
new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS)
and SCRAPE_TIMEOUT_SECONDS defaults to 30.

I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see mediacloud/metadata-lib#88)

feed_seeker DOES allow supplying a fetcher function when creating a FeedSeeker object, but the generate_feed_urls does not.

BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)

It seems like there is a known issue with the timeout parameter: mediacloud/feed_seeker#2

The text was updated successfully, but these errors were encountered:

philbudne · 2024-09-19T18:48:17Z

I wonder if source/collection rescrape could be made to run as a manage.py command (calling the task code) for easier debugging?

philbudne · 2024-10-14T16:59:52Z

https://github.com/mediacloud/web-search/pull/817/files passes connect and read timeouts to HTTP get operation, so hangs should be less likely.

In a test scrapes of Nigeria State & National & North Carolina finished, but I didn't get an email for Nigeria, so it seems there may still be issues to chase! I ran the Nigeria scrape from the command line using manage.py (took 8 hours!) and got an email, so it may be hard to reproduce!!

philbudne mentioned this issue Oct 10, 2024

Use common outgoing connection Session creation code for scraping #817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more (re)scrape work neeed? #791

more (re)scrape work neeed? #791

philbudne commented Sep 19, 2024

philbudne commented Sep 19, 2024

philbudne commented Oct 14, 2024

more (re)scrape work neeed? #791

more (re)scrape work neeed? #791

Comments

philbudne commented Sep 19, 2024

philbudne commented Sep 19, 2024

philbudne commented Oct 14, 2024