Sitemappy (or sitemap-py 😉) is a crawler that produces a sitemap for a given website.
Sitemappy is a command-line application, and also provides Python interfaces for use as a library.
- Print the URL for a given website when visited
- Print the links for a given webpage
- Visit the links for a given webpage
- Limit the links to follow on a webpage to the same single subdomain
- Concurrency (
asyncio
,multithreading
,multiprocessing
) - Output crawling results to file by default (results too long for console)
- Modify number of async crawler workers
- Specify crawling depth
- Crawling politeness argument
- Follow HTTP redirect responses
- HTTP error response handling
- Add DEBUG, INFO and ERROR logging
- Adhere to a website's
robots.txt
- "Spider Trap" resilience
- Introduce
multiprocessing
- Distributed multiprocessing
- Publish to PyPi 🚀
- GitHub Workflows (deploy)
- GitHub Workflows (linting, unit testing, dev deployments)
Generate a sitemap (./results.json
):
sitemappy-cli https://monzo.com/
$ sitemappy-cli --help
usage: sitemappy-cli [-h] BASE_URL
Sitemappy is a CLI tool to crawl a website and create a sitemap.
For more information about the tool go to https://github.com/dan-wilton/sitemappy/
Arguments:
BASE_URL a valid website URL to sitemap [required]
Options:
--workers INTEGER Number of workers to asynchronously
make web requests [default: 10]
--crawl-depth INTEGER Depth of links from base URL to follow
[default: 0 - unlimited]
--politeness-delay INTEGER Delay between each request to the website
[default: 0 - none]
--enable-cmd-out Print output to cmd
--help show this help message and exit
Python 3.12+
To use the sitemappy CLI:
pip install --user -U sitemappy-cli
pdm install
Run the tests with:
pytest -v
Use sitemappy in your project with one of the following:
with pip:
pip install -U sitemappy-cli
with PDM:
pdm add sitemappy-cli
with Poetry >= 1.2.0:
poetry add sitemappy-cli
NOTE: This is not yet enabled 😢
via homebrew:
brew install sitemappy-cli