Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more (re)scrape work neeed? #791

Open
philbudne opened this issue Sep 19, 2024 · 2 comments
Open

more (re)scrape work neeed? #791

philbudne opened this issue Sep 19, 2024 · 2 comments

Comments

@philbudne
Copy link
Contributor

I noticed this in the logs of the worker container (note: current date is 2024-09-18):

2024-09-16T21:44:29.615724832Z ==== starting _scrape_source(42118, https://www.itv.com/news/topic/nigeria, itv.com)
2024-09-16T21:44:29.621624096Z add_line: Scraped source 42118 (itv.com), https://www.itv.com/news/topic/nigeria
2024-09-16T21:44:29.622858689Z Starting new HTTPS connection (1): www.itv.com:443
2024-09-16T21:44:59.622339377Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:44:59.622713877Z Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='www.itv.com', port=443): Read timed out. (read timeout=None)")': /news/topic/nigeria
2024-09-16T21:44:59.623125050Z Starting new HTTPS connection (2): www.itv.com:443
2024-09-16T21:56:42.969479886Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
2024-09-16T21:56:43.170047325Z Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T21:56:43.170361434Z Starting new HTTPS connection (3): www.itv.com:443
2024-09-16T22:08:25.437442087Z Incremented Retry for (url='/news/topic/nigeria'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
2024-09-16T22:08:25.838354746Z Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'RemoteDisconnected('Remote end closed connection without response')': /news/topic/nigeria
2024-09-16T22:08:25.838934554Z Starting new HTTPS connection (4): www.itv.com:443

Which suggests recscrape is hanging?

It also looks like each connection (re)try is taking 11 minutes?!

mcweb's call to feed_seeker is:
new_feed_generator = feed_seeker.generate_feed_urls(homepage, max_time=SCRAPE_TIMEOUT_SECONDS)
and SCRAPE_TIMEOUT_SECONDS defaults to 30.

I would also like to have feed scraping, rss-fetching, and article-fetching all use the same requests settings (see mediacloud/metadata-lib#88)

feed_seeker DOES allow supplying a fetcher function when creating a FeedSeeker object, but the generate_feed_urls does not.

BUT it seems like feed_seeker is a Media Cloud project! (I did not know that!!)

It seems like there is a known issue with the timeout parameter: mediacloud/feed_seeker#2

@philbudne
Copy link
Contributor Author

I wonder if source/collection rescrape could be made to run as a manage.py command (calling the task code) for easier debugging?

@philbudne
Copy link
Contributor Author

https://github.com/mediacloud/web-search/pull/817/files passes connect and read timeouts to HTTP get operation, so hangs should be less likely.

In a test scrapes of Nigeria State & National & North Carolina finished, but I didn't get an email for Nigeria, so it seems there may still be issues to chase! I ran the Nigeria scrape from the command line using manage.py (took 8 hours!) and got an email, so it may be hard to reproduce!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant