Skip to content

Generate a corpus to resturant reviews for sentiment analysis

Notifications You must be signed in to change notification settings

tybrs/yelp-corpus-generator

Repository files navigation

Yelp Corpus Generator

Build Status

About

Yelp Corpus Generator is a multi-container Docker web scraping application designed to build a database of restaurant reviews for sentiment analysis. Web scarping is performed in Scrapy with JavaScript integration through Splash and storage in PostgreSQL. Build status logged at Travis CI.

docker-services-arch

Installation

Dependencies

Docker and Docker Compose

Arch Linux

$ sudo pacman -S docker docker-compose

Other Distros

For other distributions, follow the instructions here to add repository necessary to install docker.

If your preferred package manager does not have a repository for docker-compose, the appropriate binary can be added to your local binary directory with the following command.

$ sudo curl -L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Build docker image

$ sudo docker-compose build

Configuration

Usage

Start services and begin scrapping with the following command:

$ sudo docker-compose up

Testing

TODO

  • Improve documentation and docstrings.
  • Look into adding scrapy JOBDIR support.
  • Look into addig a wait function.
  • Tweak scrapy RANDOMIZE_DOWNLOAD_DELAY
  • Save date as date time objects

About

Generate a corpus to resturant reviews for sentiment analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published