Skip to content

Latest commit

 

History

History
125 lines (83 loc) · 2.67 KB

README.md

File metadata and controls

125 lines (83 loc) · 2.67 KB

📖 About

Sitemappy (or sitemap-py 😉) is a crawler that produces a sitemap for a given website.

Sitemappy is a command-line application, and also provides Python interfaces for use as a library.

Features

  • Print the URL for a given website when visited
  • Print the links for a given webpage
  • Visit the links for a given webpage
  • Limit the links to follow on a webpage to the same single subdomain
  • Concurrency (asyncio, multithreading, multiprocessing)
  • Output crawling results to file by default (results too long for console)
  • Modify number of async crawler workers
  • Specify crawling depth
  • Crawling politeness argument
  • Follow HTTP redirect responses
  • HTTP error response handling
  • Add DEBUG, INFO and ERROR logging
  • Adhere to a website's robots.txt
  • "Spider Trap" resilience
  • Introduce multiprocessing
  • Distributed multiprocessing
  • Publish to PyPi 🚀
  • GitHub Workflows (deploy)
  • GitHub Workflows (linting, unit testing, dev deployments)

🚀 Usage

Generate a sitemap (./results.json):

sitemappy-cli https://monzo.com/

Help

$ sitemappy-cli --help
usage: sitemappy-cli [-h] BASE_URL

Sitemappy is a CLI tool to crawl a website and create a sitemap.
For more information about the tool go to https://github.com/dan-wilton/sitemappy/

Arguments:
  BASE_URL              a valid website URL to sitemap [required]

Options:
  --workers           INTEGER     Number of workers to asynchronously 
                                  make web requests [default: 10]
  
  --crawl-depth       INTEGER     Depth of links from base URL to follow
                                  [default: 0 - unlimited]
  
  --politeness-delay  INTEGER     Delay between each request to the website
                                  [default: 0 - none]
  
  --enable-cmd-out                Print output to cmd
  
  --help                          show this help message and exit

🎒 Requirements

Python 3.12+

Development

PDM

💻 Installation

To use the sitemappy CLI:

pip install --user -U sitemappy-cli

Local Development / Contributing

pdm install

Run the tests with:

pytest -v

Python Library

Use sitemappy in your project with one of the following:

with pip:

pip install -U sitemappy-cli

with PDM:

pdm add sitemappy-cli

with Poetry >= 1.2.0:

poetry add sitemappy-cli

macOS

NOTE: This is not yet enabled 😢

via homebrew:

brew install sitemappy-cli