Goviq

Version: 0.1.0
Author: Julian Whiting ([email protected])

Overview

Goviq is an application for scraping Canadian Govt documents. Created with downstream RAG pipelines in mind.

Features

Asynchronous Crawling: Uses aiohttp for efficient, non-blocking I/O.
HTML Parsing: Leverages BeautifulSoup4 for HTML content extraction.
Configurable: Define your own crawler subclasses to handle specific sources or data formats.
Local Caching: Store the fetched or parsed data to JSON files for offline analysis.

Installation

Clone this repository (or download the source).
Make sure you have a modern version of pip and setuptools:
```
pip install --upgrade pip setuptools
```
Install Goviq. If you have a pyproject.toml, you can do:
```
pip install .
```
or, if you prefer the “development/editable” mode:
```
pip install -e .
```

Building the Dataset

After installing, the primary way to build the dataset is by running:

python goviq/crawler_poc.py --output_dir .

goviq/crawler_poc.py is a script that orchestrates the various crawlers to fetch, parse, and save the data.
--output_dir . tells the script to store the resulting data in the current directory.
You can change the output directory path as needed.

Usage (Alternative Methods)

If you want to call individual crawlers rather than the main script:

python -m goviq.scrapers.parl_ca

Or programmatically in Python:

from goviq.scrapers.parl_ca import BillCrawler

crawler = BillCrawler()
crawler.crawl()

TODO:

Parliament sessions are hardcoded somewhere. Ought to be able to accept a date range or list of sessions to parse
How to handle different versions of acts?
I don't know if the local cache env var is still needed. I took a long break from developing this :)
Update README.md with some info about runtime, dataset size, provenance, etc..

Contributing

Fork this repository.
Create a feature branch (git checkout -b feature/my-new-feature).
Commit your changes (git commit -am 'Add new feature').
Push to your branch (git push origin feature/my-new-feature).
Create a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
goviq		goviq
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goviq

Overview

Features

Installation

Building the Dataset

Usage (Alternative Methods)

TODO:

Contributing

About

Releases

Packages

Languages

j2whiting/goviq

Folders and files

Latest commit

History

Repository files navigation

Goviq

Overview

Features

Installation

Building the Dataset

Usage (Alternative Methods)

TODO:

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages