Junior Guru scrapers
This repository contains a source code of all scrapers Junior Guru needs for its functioning. The scrapers are implemented using the Scrapy framework and albeit there are some customizations, most Scrapy conventions should work out of the box. Contributing new scraper shouldn't be hard if you have some knowledge of how Scrapy works.
The scrapers are then deployed to the Apify platform as so called actors. The code here works as a monorepo for Apify actors and diverges quite significantly from the Scrapy template Apify provides. Deploying new scraper to Apify is a manual process and it is documented below.
Code in this repository is executed by Apify, on their infrastructure. The main Junior Guru codebase then gets the scraped data in form of datasets available through the Apify API.
Use Scrapy's crawl command or its shell.
Plucker has a crawl
CLI command, which you can also use, but it's more useful for integrating with Apify than for the actual development of the scraper.
After each scraper run you can check the contents of the items.json
file to see if your scraper works correctly.
Sometimes scrapers need input data.
Plucker's crawl
CLI command can pass parameters to your scraper through the --params
option. Use shell to pass ad-hoc data:
$ echo '{"links": [{"url": "https://junior.guru"}]}' | plucker crawl job-links --params
Use file for more complex input:
$ plucker crawl job-links --params < params.json
You can also run it simply as plucker crawl job-links --params
and type the JSON manually as a standard input of the command.
At Apify, whatever is set as the actor input gets passed down as params (except for the proxy settings). Don't forget to add input schema in your actor.json
.
You can access the params inside the spider class as self.settings.get("SPIDER_PARAMS")
.
Look at existing code and follow conventions.
Creating new scraper, e.g. gravel-bikes
:
- Should the new scraper produce items not yet known to this codebase, such as bikes, go to
jg/plucker/items.py
and add a new Scrapy Item class, e.g.GravelBike
. Runplucker schemas
to generate schema for Apify. Should the new scraper produce items already known to this codebase, such as jobs, you can skip this step. - Run
plucker new
, answer questions. It is a cookiecutter. It takes thescraper_template
directory and creates a scaffolding of a new scraper for you. - Fill the newly created
jg/plucker/gravel_bikes/spider.py
file with implementation of your scraper. See Scrapy documentation: Tutorial, Spiders You can also learn scraping from the Apify's Web scraping basics course. - Make sure the spider produces instances of the selected Item subclass, e.g.
GravelBike
. - Run the spider with
scrapy crawl gravel-bikes
. Learn about Scrapy's crawl command or its shell. Develop and debug. - Test the spider, i.e. create
tests/gravel_bikes
directory withtest_spider.py
inside and optionally with some test fixtures (static HTML files etc.) around.
- Push all your code to GitHub.
- Run
plucker deploy gravel-bikes
. - Go to Apify Console and verify everything went well.
- Go to the Builds tab and start a build.
- Go to the Runs tab and try a first run.
- Go to the Schedules page and assign your new actor to an existing schedule or create a new one.
There is a nightly GitHub Action which re-builds all actors based on current code in the main
branch.
This is because Apify's built-in automatic builds didn't work properly, but also because it would be undesirable to waste resources when committing code often.
There is a nightly GitHub Action which checks whether each actor's last run finished with success. In case they didn't, the GitHub Action fails, which causes an e-mail notification. Apify used to send summary e-mail about actor runs, but they removed that feature and there was no equivalent at the time. Maybe there is now, but the monitoring is already implemented, so…
- Use Poetry for dependency management.
Run
poetry install
. - It is preferred to pin exact versions of dependencies, without
^
, and let GitHub's Dependabot to upgrade dependencies in Pull Requests. Unfortunately there is no setting in pyproject.toml, which would force this behavior, so once new dependencies are added, one needs to go and manually remove the^
characters. - Run
pytest
to see if your code has any issues. - Run
ruff check --fix
andruff format
to fix your code.
- scraper - Generic name for a program which downloads data from web pages (or APIs). This repository uses the word scraper to refer to a spider & actor combo.
- spider - This is how Scrapy framework calls implementation of a scraper.
- actor - This is how the Apify platform calls implementation of a scraper.
- plucker - Repository of Junior Guru scrapers. In English, a plucker is one who or that which plucks. Naming in Junior Guru is usually poultry-themed, and Honza felt that plucking is a nice analogy to web scraping.
AGPL-3.0-only, copyright (c) 2024 Jan Javorek, and contributors.