You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed this in the logs of the worker container (note: current date is 2024-09-18):
2024-09-16T21:44:29.615724832Z ==== starting _scrape_source(42118, https://www.itv.com/news/topic/nigeria, itv.com)
2024-09-16T21:44:29.621624096Z add_line: Scraped source 42118 (itv.com), https://www.itv.com/news/topic/nigeria
2024-09-16T21:44:29.622858689Z Starting new HTTPS connection (1): www.itv.com:443
2024-09-16T21:44:59.622339377Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:44:59.622713877Z Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.itv.com', port=443): Read timed out. (read timeout=None)")': /news/topic/nigeria
2024-09-16T21:44:59.623125050Z Starting new HTTPS connection (2): www.itv.com:443
2024-09-16T21:56:42.969479886Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:56:43.170047325Z Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T21:56:43.170361434Z Starting new HTTPS connection (3): www.itv.com:443
2024-09-16T22:08:25.437442087Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2024-09-16T22:08:25.838354746Z Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T22:08:25.838934554Z Starting new HTTPS connection (4): www.itv.com:443
Which suggests recscrape is hanging?
It also looks like each connection (re)try is taking 11 minutes?!
mcweb's call to feed_seeker is: new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS)
and SCRAPE_TIMEOUT_SECONDS defaults to 30.
I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see mediacloud/metadata-lib#88)
feed_seeker DOES allow supplying a fetcher function when creating a FeedSeeker object, but the generate_feed_urls does not.
BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)
In a test scrapes of Nigeria State & National & North Carolina finished, but I didn't get an email for Nigeria, so it seems there may still be issues to chase! I ran the Nigeria scrape from the command line using manage.py (took 8 hours!) and got an email, so it may be hard to reproduce!!
I noticed this in the logs of the worker container (note: current date is 2024-09-18):
Which suggests recscrape is hanging?
It also looks like each connection (re)try is taking 11 minutes?!
mcweb's call to feed_seeker is:
new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS)
and
SCRAPE_TIMEOUT_SECONDS
defaults to 30.I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see mediacloud/metadata-lib#88)
feed_seeker DOES allow supplying a fetcher function when creating a
FeedSeeker
object, but thegenerate_feed_urls
does not.BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)
It seems like there is a known issue with the timeout parameter: mediacloud/feed_seeker#2
The text was updated successfully, but these errors were encountered: