DemoTrends (http://demotrends.qubole.com)
A Big Data app that displays the topics that are trending on Wikipedia.
There are two main parts:
-
Webapp in Ruby on Rails.
-
Data pipeline hosted in Qubole Data Service
You can read more about demotrends in this Blog
- Register for a [Trial Plan] (http://www.qubole.com/try) in Qubole
- [Obtain the API key] (http://www.qubole.com/qds-api-reference/authentication/)
- Run the commands in the commands directory
Code required to setup the demo trends website (http://demotrends.qubole.com)
- Create the database -
./webapp/script/init-mysql.sh
- Run the migrations:
rake db:migrate
- Using Sample Data:
rake db:seed
These will insert one row in each of the tables. - Using SQL Dump: You can also use SQL dump file to populate your DB. This file has data from processed data from 30th June 2013 - 13th August 2013.
sudo mysql trend < webapp/db/sqldump/mysqldump_13AUG13.sql
- Run
./webapp/script/restart_server.sh
Directory contains two UDFs required by the data pipeline:
- collect_all - A JAR UDF
- hive_trend_mapper - A Python UDF
Directory contains scripts that are run in a Shell Command.
- pagecount_dump.py - A script to download ONE days pagecounts data from the Wikimedia website.
Directory contains all the commands to process one day's worth of data. The sequence of commands is important. The filenames start with a number specifying the sequence it should be executed in. Run the scripts using [Qubole Python SDK] (https://github.com/qubole/qds-sdk-py)
If you want to use Apache Airflow to manage the pipeline, please look at airflow
folder.