-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError("cannot join thread before it is started") #25
Comments
@dtlhlbs I can't seem to replicate this, since the scraper runs successfully on the Typesense docs site for eg... Here's the docker tag for the previous version of the scraper that you should be able to pin to, until we get to the bottom of this issue: |
I can't seem to replicate this. I spun up a brand new VM (Intel CPU, Amazon Linux), installed docker on it and then ran the scraper on the Typesense docs site and it worked fine for me: $ docker run -it --env-file=.env -e CONFIG="{\"index_name\":\"typesense_docs\",\"start_urls\":[{\"url\":\"https://typesense.org/docs/(?P<version>.*?)/\",\"variables\":{\"version\":[\"0.24.0\",\"0.23.1\",\"0.23.0\",\"0.22.2\",\"0.22.1\",\"0.22.0\",\"0.21.0\",\"0.20.0\",\"0.19.0\",\"0.18.0\",\"0.17.0\",\"0.16.1\",\"0.16.0\",\"0.15.0\",\"0.14.0\",\"0.13.0\",\"0.12.0\",\"0.11.2\"]}},{\"url\":\"https://typesense.org/docs/overview/\"},{\"url\":\"https://typesense.org/docs/guide/\"}],\"selectors\":{\"default\":{\"lvl0\":\".content__default h1\",\"lvl1\":\".content__default h2\",\"lvl2\":\".content__default h3\",\"lvl3\":\".content__default h4\",\"lvl4\":\".content__default h5\",\"text\":\".content__default p, .content__default ul li, .content__default table tbody tr\"}},\"scrape_start_urls\":false,\"strip_chars\":\" .,;:#\"}" typesense/docsearch-scraper
Unable to find image 'typesense/docsearch-scraper:latest' locally
latest: Pulling from typesense/docsearch-scraper
677076032cca: Pull complete
3026efbcce37: Pull complete
b83c999f3ae6: Pull complete
4f4fb700ef54: Pull complete
4d02e570415e: Pull complete
fe9dd39ad932: Pull complete
40bdd8cbcb60: Pull complete
330e95c637fc: Pull complete
1c4235bc81bd: Pull complete
f636e29df4a6: Pull complete
2ee46e1d6efd: Pull complete
f2a90558593e: Pull complete
f7cb19d7ba62: Pull complete
b51fd8a46836: Pull complete
72e3879aa441: Pull complete
b656e2665916: Pull complete
95462c1394e2: Pull complete
0a6c9231c464: Pull complete
02b4a1743fdf: Pull complete
fcb6abf81668: Pull complete
066a7661e7fb: Pull complete
b1349c66a67d: Pull complete
cb04953d313a: Pull complete
83cfbae1faa8: Pull complete
4aa2727acdc6: Pull complete
Digest: sha256:ffce60fae1358cfe8ba8a59a50b24dfd835610e543b5fbadba5a84541f7e8b2f
Status: Downloaded newer image for typesense/docsearch-scraper:latest
INFO:scrapy.utils.log:Scrapy 2.8.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 23.0.0 (OpenSSL 3.0.8 7 Feb 2023), cryptography 39.0.1, Platform Linux-5.10.165-143.735.amzn2.x86_64-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'ERROR',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Algolia DocSearch Crawler'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
warn(
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
self.fingerprinter = fingerprinter or RequestFingerprinter()
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.17.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/guide/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.11.2/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/overview/> (referer: None)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.20.0/> (referer: None)
DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET https://typesense.org/docs/0.24.0/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.0/api/> (referer: https://typesense.org/docs/0.23.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.1/api/> (referer: https://typesense.org/docs/0.22.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.18.0/api/> (referer: https://typesense.org/docs/0.18.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.2/api/> (referer: https://typesense.org/docs/0.22.2/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.21.0/api/> (referer: https://typesense.org/docs/0.21.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.23.1/api/> (referer: https://typesense.org/docs/0.23.1/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.24.0/api/> (referer: https://typesense.org/docs/0.24.0/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.23.0/api/ 54 records)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.12.0/api/> (referer: https://typesense.org/docs/0.12.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.19.0/api/> (referer: https://typesense.org/docs/0.19.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.13.0/api/> (referer: https://typesense.org/docs/0.13.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.14.0/api/> (referer: https://typesense.org/docs/0.14.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.15.0/api/> (referer: https://typesense.org/docs/0.15.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.0/api/> (referer: https://typesense.org/docs/0.16.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.22.0/api/> (referer: https://typesense.org/docs/0.22.0/)
DEBUG:scrapy.core.engine:Crawled (200) <GET https://typesense.org/docs/0.16.1/api/> (referer: https://typesense.org/docs/0.16.1/)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.1/api/ 51 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.18.0/api/ 6 records)
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making post /collections/typesense_docs_1677084390/documents/import
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "POST /collections/typesense_docs_1677084390/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
> DocSearch: https://typesense.org/docs/0.22.2/api/ 55 records)
.
.
.
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 77769,
'downloader/request_count': 277,
'downloader/request_method_count/GET': 277,
'downloader/response_bytes': 2140857,
'downloader/response_count': 277,
'downloader/response_status_count/200': 276,
'downloader/response_status_count/404': 1,
'dupefilter/filtered': 10089,
'elapsed_time_seconds': 453.931499,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 2, 22, 16, 54, 6, 607301),
'httpcompression/response_bytes': 11532228,
'httpcompression/response_count': 277,
'memusage/max': 120295424,
'memusage/startup': 68550656,
'request_depth_max': 3,
'response_received_count': 277,
'scheduler/dequeued': 277,
'scheduler/dequeued/memory': 277,
'scheduler/enqueued': 277,
'scheduler/enqueued/memory': 277,
'start_time': datetime.datetime(2023, 2, 22, 16, 46, 32, 675802)}
INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:typesense.api_call:Making get /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "GET /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/typesense_docs
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "PUT /aliases/typesense_docs HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/typesense_docs_1677081767
DEBUG:typesense.api_call:Try 1 to node x3s805zrawjuod9fp.a1.typesense.net:443 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): x3s805zrawjuod9fp.a1.typesense.net:443
DEBUG:urllib3.connectionpool:https://x3s805zrawjuod9fp.a1.typesense.net:443 "DELETE /collections/typesense_docs_1677081767 HTTP/1.1" 200 None
DEBUG:typesense.api_call:x3s805zrawjuod9fp.a1.typesense.net:443 is healthy. Status code: 200
Nb hits: 9097 |
@jasonbosco Thanks for looking at this. I have managed to build a 0.3.4 image and pushed this to our own registry and that's got us up and running again. I couldn't pull the docker tag you mentioned due to manifest not found/manifest unknown. I think it's likely something related to the CI environment. It's running docker in docker, building a docusaurus container including typesense, running that container, then spinning up the scraper to scrape the docusaurus site before imaging the results and deploying the docusaurus container with updated index. I'd have to recreate this environment and the problem, maybe via docker compose and send that through. Maybe until then we see if anyone else sees this error. Increasing the CPU and memory capacity of the host didn't help. |
I see, thank you for that additional context. Could you check if this build works: #28 (comment)? |
@jasonbosco yes, |
May I know what version of Docker engine you're using? |
Docker version 20.10.9, build c2ea9bc |
I've got the same error. After upgrading docker to the latest version error has gone. |
Description
New and old CI jobs running Docker image
typesense/docsearch-scraper
are failing withRuntimeError("cannot join thread before it is started")
This is also failing old jobs that previously ran, so I think it's the implicit use of
typesense/docsearch-scraper:latest
, being 0.4.0 that was just put up. Since there's no tag - can we get a tag fortypesense/docsearch-scraper:0.3.4
so I can pin to that?Steps to reproduce
I am running the following command in Gitlab CI using a container based on Alpine
Expected Behavior
Scraper will index site on the same host
Actual Behavior
Errors with last error
RuntimeError("cannot join thread before it is started")
Metadata
Typesense Version:
'latest'
OS:
Alpine Linux v3.12
docsearch-scraper.log
DocSearch.config.json.txt
The text was updated successfully, but these errors were encountered: