Skip to content

SampleClean Retreat Summer 2014 Demo

Sanjay Krishnan edited this page May 15, 2014 · 8 revisions

Create the table

   sampleclean> CREATE TABLE wikipedia (title string, text string)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' LINES TERMINATED BY '\n';
   OK
   Time taken: 0.035 seconds

Load the data into the table

   sampleclean> LOAD DATA LOCAL INPATH 'data/files/wikipedia_abstracts.csv' 
   OVERWRITE INTO TABLE wikipedia;
   Copying data from file:/home/sanjayk/sampleclean_dev/blinkdb/data/files/wikipedia_abstracts.csv
   Copying file: file:/home/sanjayk/sampleclean_dev/blinkdb/data/files/wikipedia_abstracts.csv
   Loading data to table default.wikipedia
   Deleted file:/user/hive/warehouse/wikipedia
   OK
   Time taken: 14.109 seconds

Run some queries on the dataset

   sampleclean> SELECT COUNT(1) FROM wikipedia;                                                       
   OK
   4004479
   Time taken: 8.535 seconds

Number of articles Referring to "Apple" sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%apple%'; OK 11261 Time taken: 36.853 seconds

Number of articles Referring to "Google" sampleclean> SELECT COUNT(1) FROM wikipedia where lower(text) like '%google%'; OK 7400 Time taken: 33.109 seconds