Root relative path infinite loop #248
Replies: 4 comments 4 replies
-
Here is another example that bypasses the depth - it goes infinitely without increasing the path segment depth. (ie. if you had a depth limit of 5):
|
Beta Was this translation helpful? Give feedback.
-
Hi @phughesion , this is how we calculate the depth. We do not want to do anything with the link depth in this situation. Please convert this into a discussion. Spider keeps track of the crawling by using the URL. This is not a bug and how we want to handle the web crawling. |
Beta Was this translation helpful? Give feedback.
-
I have found an example of a crawler that does both root-relative and base-relative and also has the link-depth feature I was talking about. This is using
This crawler crawls directory listings as I would expect, and also will stop after reaching the max link-depth.
These two features would really add a lot of utility to spider-rs. I really appreciate how blazingly fast it is, but it currently doesn't fit my use case without these features. I need to crawl deep directory structures, but not go infinite. Crawlee's Crawlee results for reference:
|
Beta Was this translation helpful? Give feedback.
-
@phughesion thank you for the interest and the research for this. We now support relative directories |
Beta Was this translation helpful? Give feedback.
-
I am making a separate issue for this because while it is related to this issue, it is for the currently implemented root-relative only traversal mechanism.
Python web server for testing:
malweb.py
:python3 -m flask --app malweb run
Spider:
The infinite recursion problem is a problem for both root-relative and base-relative URLs. Spider should handle this accordingly by keeping track of the link depth and stop crawling once the link depth is reached. There should be detection of this behavior rather than the current depth process that only goes based on the number of path segments.
Beta Was this translation helpful? Give feedback.
All reactions