Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler #953

Open
Pijukatel opened this issue Feb 3, 2025 · 5 comments · May be fixed by #959
Assignees
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Pijukatel
Copy link
Contributor

Pijukatel commented Feb 3, 2025

Currently arguments that allow to change how different return codes are handled are available only to static http-based crawlers. Those arguments can be used in crawler __init__, but are not available in PlaywrightCrawler. If someone wants to for example ignore 403 error:

crawler = ParselCrawler(..., ignore_http_error_status_codes = {403})

but in PlaywrightCrawler they have to do something like this:

crawler = PlaywrightCrawler(...)
crawler._http_client._ignore_http_error_status_codes = {403}

That is very confusing and users will hardly even know about it. The PlaywrightCrawler behavior should be aligned with other crawlers and these should be possible to set in __init__

@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Feb 3, 2025
@janbuchar
Copy link
Collaborator

In PlaywrightCrawler, they could do

crawler = PlaywrightCrawler(
  ...,
  http_client=HttpxHttpClient(
    ignore_http_error_status_codes = {403}
  )
)

...which is less bad than touching protected attributes.

But still, this does something different than the ParselCrawler example. In ParselCrawler, the http client is used for fetching the website itself, while in PlaywrightCrawler, you only use it for context.send_request. So even if you ignore the 403 status code, it will have no effect on what the crawler does during regular crawling, which is also confusing.

To handle http status codes received during navigation, we'd have to implement this separately for PlaywrightCrawler.

@Pijukatel
Copy link
Contributor Author

Pijukatel commented Feb 3, 2025

In PlaywrightCrawler, they could do

crawler = PlaywrightCrawler(
...,
http_client=HttpxHttpClient(
ignore_http_error_status_codes = {403}
)
)

...which is less bad than touching protected attributes.

But still, this does something different than the ParselCrawler example. In ParselCrawler, the http client is used for fetching the website itself, while in PlaywrightCrawler, you only use it for context.send_request. So even if you ignore the 403 status code, it will have no effect on what the crawler does during regular crawling, which is also confusing.

To handle http status codes received during navigation, we'd have to implement this separately for PlaywrightCrawler.

PlaywrightCrawler already handles status codes received during navigation, but in somewhat non-obvious way
https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_playwright/_playwright_crawler.py#L255

Where it inherits _is_session_blocked_status_code from BasicCrawler, that looks into self._http_client.additional_blocked_status_codes. So that is why even navigation status codes can be handled through crawler._http_client

(I can even imagine use case where "main nagivation".additional_blocked_status_codes are different from crawler._http_client.additional_blocked_status_codes, which currently is not possible.)

@janbuchar
Copy link
Collaborator

PlaywrightCrawler already handles status codes received during navigation, but in somewhat non-obvious way https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_playwright/_playwright_crawler.py#L255

Where it inherits _is_session_blocked_status_code from BasicCrawler, that looks into self._http_client.additional_blocked_status_codes. So that is why even navigation status codes can be handled through crawler._http_client

I see. Sorry for lying then! I think that this code deserves some serious refactoring, making PlaywrightCrawler behavior depend on internals of http client is not optimal.

Also, it looks like this is kinda related to #830.

(I can even imagine use case where "main nagivation".additional_blocked_status_codes are different from crawler._http_client.additional_blocked_status_codes, which currently is not possible.)

So navigation would have a different set of "blocked status codes" than send_request?

@Pijukatel
Copy link
Contributor Author

So navigation would have a different set of "blocked status codes" than send_request?

That is up for discussion. I can create imaginary scenario in my head for it, but maybe it is just a theoretical one and there is no actual need for it. So maybe it is better to make them same initially and separate them only if required by users.

@B4nan
Copy link
Member

B4nan commented Feb 3, 2025

Let's not deal with that now.

So maybe it is better to make them same initially and separate them only if required by users.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants