Skip to content

Latest commit

 

History

History
63 lines (51 loc) · 3.13 KB

02.md

File metadata and controls

63 lines (51 loc) · 3.13 KB

HBase Assignment

This assignment is intended to give you a little more hands on experience with HBase. You will also get practice designing a data model and arguing why the model appropriate. You will have the opportunity to propose an interesting question, express why that question is interesting, implement a solution, and communicate how to use your solution. You may work with others, use the same data set, or even ask the same question. However, you are expected to do your own write-ups. Note that it is fine (even encouraged) to have others read your writeup and provide feedback.

Part 1

We will create a database of nutritional facts. Follow the steps in Seven DB in seven weeks for HBase Day 2 (e.g. notes on Column store), "Do" question 1 and 2. Note that the question "What should we use for the row key?" is not that useful if we do not know the questions that we will ask. Propose a query and a schema for efficiently answering the query. Explain why the schema is a good choice. Next you will write three jruby scripts. First, create.rb to create your database. Second, load.rb to load data into the database. Third query.rb that queries your database.

It looks like the link in the book for the myfoodapediadata.zip file is out of date. You can find it on data.gov

To submit For part 1, create a folder called part1 containing three ruby files create.rb, load.rb, query.rb, and one pdf file called writeup.pdf stating your question and explaining your design choices.

Part 2

You will have to come up with a question that uses the nutritional facts and an additional dataset. A good starting point for finding an additional data set is data catalog. Select a data set. Using the nutritional facts and your data set, create a table that answers your question without needing to scan the entire data set (but maybe uses ranges). You may need to write some additional code to pre-process your data before data entry. (If you need to write some pre-processing code, it is ok if it is not the most efficient code in the world. But it should work over your data set on your computer.)

To submit

For part 2, create folder called "part2" containing three ruby files files create.rb, load.rb, query.rb, for creating, loading and querying, respectively. It should also contain a pdf file called writeup.pdf. If you created any additional codes for preprocessing the data, please include them in the folder part2/preprocess. The writeup should state your question and explain your design choices. It should provide a minimum of 2 sentences explaining WHY the question you are answering is interesting. In addition, the writeup should provide instructions for how to use what you have prepared. Do NOT include the data files, however, your writeup should provide a link or explanation for how one can acquire the data.

How to submit

In a git repository on Github, create a tag called hw02. Push your code, writeups, and tag, and add me to your repository (my username is dlm).

On d2l, in the textbox provide a link to the repository.