Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Pijukatel · 2025-02-03T08:56:59Z

Currently arguments that allow to change how different return codes are handled are available only to static http-based crawlers. Those arguments can be used in crawler __init__, but are not available in PlaywrightCrawler. If someone wants to for example ignore 403 error:

crawler = ParselCrawler(..., ignore_http_error_status_codes = {403})

but in PlaywrightCrawler they have to do something like this:

crawler = PlaywrightCrawler(...)
crawler._http_client._ignore_http_error_status_codes = {403}

That is very confusing and users will hardly even know about it. The PlaywrightCrawler behavior should be aligned with other crawlers and these should be possible to set in __init__

The text was updated successfully, but these errors were encountered:

janbuchar · 2025-02-03T10:00:51Z

In PlaywrightCrawler, they could do

crawler = PlaywrightCrawler(
  ...,
  http_client=HttpxHttpClient(
    ignore_http_error_status_codes = {403}
  )
)

...which is less bad than touching protected attributes.

But still, this does something different than the ParselCrawler example. In ParselCrawler, the http client is used for fetching the website itself, while in PlaywrightCrawler, you only use it for context.send_request. So even if you ignore the 403 status code, it will have no effect on what the crawler does during regular crawling, which is also confusing.

To handle http status codes received during navigation, we'd have to implement this separately for PlaywrightCrawler.

Pijukatel · 2025-02-03T10:56:10Z

In PlaywrightCrawler, they could do

crawler = PlaywrightCrawler(
...,
http_client=HttpxHttpClient(
ignore_http_error_status_codes = {403}
)
)

...which is less bad than touching protected attributes.

But still, this does something different than the ParselCrawler example. In ParselCrawler, the http client is used for fetching the website itself, while in PlaywrightCrawler, you only use it for context.send_request. So even if you ignore the 403 status code, it will have no effect on what the crawler does during regular crawling, which is also confusing.

To handle http status codes received during navigation, we'd have to implement this separately for PlaywrightCrawler.

PlaywrightCrawler already handles status codes received during navigation, but in somewhat non-obvious way
https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_playwright/_playwright_crawler.py#L255

Where it inherits _is_session_blocked_status_code from BasicCrawler, that looks into self._http_client.additional_blocked_status_codes. So that is why even navigation status codes can be handled through crawler._http_client

(I can even imagine use case where "main nagivation".additional_blocked_status_codes are different from crawler._http_client.additional_blocked_status_codes, which currently is not possible.)

janbuchar · 2025-02-03T12:32:12Z

PlaywrightCrawler already handles status codes received during navigation, but in somewhat non-obvious way https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_playwright/_playwright_crawler.py#L255

Where it inherits _is_session_blocked_status_code from BasicCrawler, that looks into self._http_client.additional_blocked_status_codes. So that is why even navigation status codes can be handled through crawler._http_client

I see. Sorry for lying then! I think that this code deserves some serious refactoring, making PlaywrightCrawler behavior depend on internals of http client is not optimal.

Also, it looks like this is kinda related to #830.

(I can even imagine use case where "main nagivation".additional_blocked_status_codes are different from crawler._http_client.additional_blocked_status_codes, which currently is not possible.)

So navigation would have a different set of "blocked status codes" than send_request?

Pijukatel · 2025-02-03T12:50:23Z

So navigation would have a different set of "blocked status codes" than send_request?

That is up for discussion. I can create imaginary scenario in my head for it, but maybe it is just a theoretical one and there is no actual need for it. So maybe it is better to make them same initially and separate them only if required by users.

B4nan · 2025-02-03T12:55:20Z

Let's not deal with that now.

So maybe it is better to make them same initially and separate them only if required by users.

👍

Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Feb 3, 2025

B4nan assigned Pijukatel Feb 3, 2025

Pijukatel linked a pull request Feb 5, 2025 that will close this issue

feat!: Enable additional status codes arguments to PlaywrightCrawler #959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Pijukatel commented Feb 3, 2025 •

edited

Loading

janbuchar commented Feb 3, 2025

Pijukatel commented Feb 3, 2025 •

edited

Loading

janbuchar commented Feb 3, 2025

Pijukatel commented Feb 3, 2025

B4nan commented Feb 3, 2025

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Comments

Pijukatel commented Feb 3, 2025 • edited Loading

janbuchar commented Feb 3, 2025

Pijukatel commented Feb 3, 2025 • edited Loading

janbuchar commented Feb 3, 2025

Pijukatel commented Feb 3, 2025

B4nan commented Feb 3, 2025

Pijukatel commented Feb 3, 2025 •

edited

Loading

Pijukatel commented Feb 3, 2025 •

edited

Loading