Yelp Corpus Generator is a multi-container Docker web scraping application designed to build a database of restaurant reviews for sentiment analysis. Web scarping is performed in Scrapy with JavaScript integration through Splash and storage in PostgreSQL. Build status logged at Travis CI.
Arch Linux
$ sudo pacman -S docker docker-compose
Other Distros
For other distributions, follow the instructions here to add repository necessary to install docker
.
If your preferred package manager does not have a repository for docker-compose
, the appropriate binary can be added to your local binary directory with the following command.
$ sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo docker-compose build
Start services and begin scrapping with the following command:
$ sudo docker-compose up
TODO
- Improve documentation and docstrings.
- Look into adding scrapy
JOBDIR
support. - Look into addig a wait function.
- Tweak scrapy
RANDOMIZE_DOWNLOAD_DELAY
- Save date as date time objects