Skip to content

Sitemappy (or sitemap-py πŸ˜‰) is a CLI tool that crawls a given website to produce a sitemap.

Notifications You must be signed in to change notification settings

dan-wilton/sitemappy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“– About

Sitemappy (or sitemap-py πŸ˜‰) is a crawler that produces a sitemap for a given website.

Sitemappy is a command-line application, and also provides Python interfaces for use as a library.

Features

  • Print the URL for a given website when visited
  • Print the links for a given webpage
  • Visit the links for a given webpage
  • Limit the links to follow on a webpage to the same single subdomain
  • Concurrency (asyncio, multithreading, multiprocessing)
  • Output crawling results to file by default (results too long for console)
  • Modify number of async crawler workers
  • Specify crawling depth
  • Crawling politeness argument
  • Follow HTTP redirect responses
  • HTTP error response handling
  • Add DEBUG, INFO and ERROR logging
  • Adhere to a website's robots.txt
  • "Spider Trap" resilience
  • Introduce multiprocessing
  • Distributed multiprocessing
  • Publish to PyPi πŸš€
  • GitHub Workflows (deploy)
  • GitHub Workflows (linting, unit testing, dev deployments)

πŸš€ Usage

Generate a sitemap (./results.json):

sitemappy-cli https://monzo.com/

Help

$ sitemappy-cli --help
usage: sitemappy-cli [-h] BASE_URL

Sitemappy is a CLI tool to crawl a website and create a sitemap.
For more information about the tool go to https://github.com/dan-wilton/sitemappy/

Arguments:
  BASE_URL              a valid website URL to sitemap [required]

Options:
  --workers           INTEGER     Number of workers to asynchronously 
                                  make web requests [default: 10]
  
  --crawl-depth       INTEGER     Depth of links from base URL to follow
                                  [default: 0 - unlimited]
  
  --politeness-delay  INTEGER     Delay between each request to the website
                                  [default: 0 - none]
  
  --enable-cmd-out                Print output to cmd
  
  --help                          show this help message and exit

πŸŽ’ Requirements

Python 3.12+

Development

PDM

πŸ’» Installation

To use the sitemappy CLI:

pip install --user -U sitemappy-cli

Local Development / Contributing

pdm install

Run the tests with:

pytest -v

Python Library

Use sitemappy in your project with one of the following:

with pip:

pip install -U sitemappy-cli

with PDM:

pdm add sitemappy-cli

with Poetry >= 1.2.0:

poetry add sitemappy-cli

macOS

NOTE: This is not yet enabled 😒

via homebrew:

brew install sitemappy-cli

About

Sitemappy (or sitemap-py πŸ˜‰) is a CLI tool that crawls a given website to produce a sitemap.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages