Version: 0.1.0
Author: Julian Whiting ([email protected])
Goviq is an application for scraping Canadian Govt documents. Created with downstream RAG pipelines in mind.
- Asynchronous Crawling: Uses
aiohttp
for efficient, non-blocking I/O. - HTML Parsing: Leverages
BeautifulSoup4
for HTML content extraction. - Configurable: Define your own crawler subclasses to handle specific sources or data formats.
- Local Caching: Store the fetched or parsed data to JSON files for offline analysis.
- Clone this repository (or download the source).
- Make sure you have a modern version of
pip
andsetuptools
:pip install --upgrade pip setuptools
- Install Goviq. If you have a
pyproject.toml
, you can do:or, if you prefer the “development/editable” mode:pip install .
pip install -e .
After installing, the primary way to build the dataset is by running:
python goviq/crawler_poc.py --output_dir .
goviq/crawler_poc.py
is a script that orchestrates the various crawlers to fetch, parse, and save the data.--output_dir .
tells the script to store the resulting data in the current directory.- You can change the output directory path as needed.
If you want to call individual crawlers rather than the main script:
python -m goviq.scrapers.parl_ca
Or programmatically in Python:
from goviq.scrapers.parl_ca import BillCrawler
crawler = BillCrawler()
crawler.crawl()
- Parliament sessions are hardcoded somewhere. Ought to be able to accept a date range or list of sessions to parse
- How to handle different versions of acts?
- I don't know if the local cache env var is still needed. I took a long break from developing this :)
- Update README.md with some info about runtime, dataset size, provenance, etc..
- Fork this repository.
- Create a feature branch (
git checkout -b feature/my-new-feature
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to your branch (
git push origin feature/my-new-feature
). - Create a pull request.