Root relative path infinite loop #248

phughesion · 2025-01-20T20:00:34Z

phughesion
Jan 20, 2025

I am making a separate issue for this because while it is related to this issue, it is for the currently implemented root-relative only traversal mechanism.

Python web server for testing:
malweb.py:

from flask import Flask

app = Flask(__name__)

@app.route("/", methods=["GET"])
def root():
    return f"""
        <a href="/catch_all_root_relative/1/">Go deeper</a>
    """

@app.route("/catch_all_root_relative/<path:text>", methods=["GET"])
def catch_all_root_relative(text=None):
    count = sum([int(x) for x in text if x == "1"]) + 1
    text = "/catch_all_root_relative" + ("/1" * count) + "/"
    return f"""
        <a href="{text}">Go deeper</a>
    """

python3 -m flask --app malweb run

Spider:

extern crate spider;

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("http://127.0.0.1:5000/")
    .build()
    .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("{:?}", res.get_url());
        }
    });

    website.crawl_smart().await;

    println!("Links found {:?}", website.get_size().await);
}

The infinite recursion problem is a problem for both root-relative and base-relative URLs. Spider should handle this accordingly by keeping track of the link depth and stop crawling once the link depth is reached. There should be detection of this behavior rather than the current depth process that only goes based on the number of path segments.

phughesion · 2025-01-20T20:18:33Z

phughesion
Jan 20, 2025
Author

Here is another example that bypasses the depth - it goes infinitely without increasing the path segment depth. (ie. if you had a depth limit of 5):

from flask import Flask
import uuid

app = Flask(__name__)

@app.route("/", methods=["GET"])
def root():
    guid = uuid.uuid1()
    hrefs = [
        #f'<a href="/linkloop1/{guid}">Get stuck in a link loop</a>',
        #f'<a href="/catch_all_br/1/">Go deeper</a>',
        #f'<a href="/1/">Loop 1</a>',
        #f'<a href="/catch_all_root_relative/1/">Go deeper</a>',
        f'<a href="/infinite_root_relative_shallow/{guid}">Go deeper</a>',
    ]
    list_items = "\n".join([f"<li>{x}</li>" for x in hrefs])
    return f"""
        <ul>
            {list_items}
        </ul>
    """

@app.route("/infinite_root_relative_shallow/<uuid:guid>", methods=["GET"])
def infinite_root_relative_shallow(guid=None):
    link = uuid.uuid1()
    return f"""
        <a href="/{link}">Go deeper</a>
    """

@app.route("/<uuid:guid>", methods=["GET"])
def infinite_root_relative_shallow_guid(guid=None):
    link = uuid.uuid1()
    link2 = uuid.uuid1()
    return f"""
        <a href="/{link}/{link2}">Go deeper</a>
    """

@app.route("/<uuid:guid1>/<uuid:guid2>", methods=["GET"])
def infinite_root_relative_shallow_guid2(guid1=None, guid2=None):
    link = uuid.uuid1()
    return f"""
        <a href="/infinite_root_relative_shallow/{link}">Go deeper</a>
    """

0 replies

j-mendez · 2025-01-20T20:39:22Z

j-mendez
Jan 20, 2025
Maintainer

Hi @phughesion , this is how we calculate the depth. We do not want to do anything with the link depth in this situation. Please convert this into a discussion. Spider keeps track of the crawling by using the URL. This is not a bug and how we want to handle the web crawling.

.

0 replies

phughesion · 2025-01-20T22:20:47Z

phughesion
Jan 20, 2025
Author

I have found an example of a crawler that does both root-relative and base-relative and also has the link-depth feature I was talking about.

This is using crawlee-python:

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
    crawler = BeautifulSoupCrawler(
        max_crawl_depth=25
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        await context.enqueue_links()

    await crawler.run(['http://localhost:5000/'])


if __name__ == '__main__':
    asyncio.run(main())

This crawler crawls directory listings as I would expect, and also will stop after reaching the max link-depth.
Here is my sample malicious web server:

from flask import Flask
import uuid

app = Flask(__name__)

@app.route("/", methods=["GET"])
def root():
    guid = uuid.uuid1()
    hrefs = [
        f'<a href="/catch_all_root_relative/1/1/1/1/1/1/1/1/">Go deeper</a>',
    ]
    list_items = "\n".join([f"<li>{x}</li>" for x in hrefs])
    return f"""
        <ul>
            {list_items}
        </ul>
    """

@app.route("/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/", methods=["GET"])
def loop_end():
    return f"""
        <a href="/2">Surprise!</a>
    """

@app.route("/2", methods=["GET"])
def two():
    return f"""
        <p>You found me!</p>
    """

@app.route("/catch_all_root_relative/<path:text>", methods=["GET"])
def catch_all_root_relative(text=None):
    count = sum([int(x) for x in text if x == "1"]) + 1
    text = "/catch_all_root_relative" + ("/1" * count) + "/"
    return f"""
        <a href="{text}">Go deeper</a>
    """

These two features would really add a lot of utility to spider-rs. I really appreciate how blazingly fast it is, but it currently doesn't fit my use case without these features. I need to crawl deep directory structures, but not go infinite. Crawlee's max_crawl_depth parameter implements the link-depth I was referring to.

Crawlee results for reference:

[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/catch_all_root_relative/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/ ...
[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Processing http://localhost:5000/2 ...

4 replies

phughesion Jan 20, 2025
Author

Crawlee also allows you to start the crawl on a non-root path:
await crawler.run(['http://localhost:5000/catch_all_root_relative/1/1/1/1/']) will only crawl from that page. It doesn't just start at http://localhost:5000/ and continue from there.

j-mendez Jan 20, 2025
Maintainer

Crawlee also allows you to start the crawl on a non-root path:

await crawler.run(['http://localhost:5000/catch_all_root_relative/1/1/1/1/']) will only crawl from that page. It doesn't just start at http://localhost:5000/ and continue from there.

Spider you can start from any url as well. We only join the links found on pages based on the root.

j-mendez Jan 20, 2025
Maintainer

Ok, I see how max depth could be used to help prevent the infinite recursion thank you. Good thinking, we can some other logic underneath to help determine the malicious page. The 404 status code is an important thing to take note of - and maybe some other way to determine recursion like hashing something that can be used to identify uniqueness.

phughesion Jan 21, 2025
Author

Crawlee also allows you to start the crawl on a non-root path:

await crawler.run(['http://localhost:5000/catch_all_root_relative/1/1/1/1/']) will only crawl from that page. It doesn't just start at http://localhost:5000/ and continue from there.

Spider you can start from any url as well. We only join the links found on pages based on the root.

Well, it technically starts from anywhere, but due to the root-relative issue it doesn't actually crawl the pages from the specified URL if it uses base-relative (non-prefixed) hrefs, since it defaults to the root. This is why I called it a bug, even though it's intentional.

j-mendez · 2025-01-25T16:17:21Z

j-mendez
Jan 25, 2025
Maintainer

@phughesion thank you for the interest and the research for this. We now support relative directories v2.27.0.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spider-rs

Root relative path infinite loop #248

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

spider-rs

Root relative path infinite loop #248

phughesion Jan 20, 2025

Replies: 4 comments · 4 replies

phughesion Jan 20, 2025 Author

j-mendez Jan 20, 2025 Maintainer

phughesion Jan 20, 2025 Author

phughesion Jan 20, 2025 Author

j-mendez Jan 20, 2025 Maintainer

j-mendez Jan 20, 2025 Maintainer

phughesion Jan 21, 2025 Author

j-mendez Jan 25, 2025 Maintainer

phughesion
Jan 20, 2025

Replies: 4 comments 4 replies

phughesion
Jan 20, 2025
Author

j-mendez
Jan 20, 2025
Maintainer

phughesion
Jan 20, 2025
Author

phughesion Jan 20, 2025
Author

j-mendez Jan 20, 2025
Maintainer

j-mendez Jan 20, 2025
Maintainer

phughesion Jan 21, 2025
Author

j-mendez
Jan 25, 2025
Maintainer