Due: Mon Oct 11
In this assignment, we will get a little hands on experience with building some of the parts for querying a relational database. You will implement the three join algorithms discussed in class (as well as other additional tools for supporting those algorithms), develop a cost model for a very limited set of queries, and assess that model.
The deliverables will be your code and a brief writeup (about 5 pages) describing the cost model that you developed and an assessment of how well the cost model does in estimating the cost of evaluating queries. Make sure to discuss the implementation decisions that you made and how those effect the cost model (as well as any other assumption that you have with respect to your computers hardware architecture).
While you are welcome to build up your system however you wish, I personally, would take the following steps:
-
Decide on a very simple schema for a few tables that allows you to make some interesting natural join queries. Generate some data sets of various sizes that allows you to develop quickly, but also run some tests on larger datasets.
-
start working on some tools for estimating the cost of query. While you don't need to have developed the final model, getting started early will allow you to build up the model as you develop other pieces of the system
-
start working on some tools for assessing the accuracy of your model. While you don't need to have your experiments developed, getting started early will allow you to see how well your model is doing as you develop other pieces of the system.
-
write a set of tools for reading and writing blocks to and from disk. It would be helpful if you think about this code being general enough to allow you to read blocks from an indexing data structure and a relation.
-
implement a linear scans that does not use an index
-
add the linear scan to your model and assess how accurate the model is.
-
write this up (in rough draft form)
-
implement an indexing data structure that supports reading (but doesn't have to support updates)
-
implement a scan that does uses the index (add it to your model, and update your assessment, and write up in rough draft form)
-
implement three join algorithms (after each, add it to your model and update your assessment, and write up in rough draft form).
-
you now have a huge set of tools, do something super cool! A few ways to help you get started.... How often can your system take a new query or be run over a new dataset, propose an evaluation strategy and be right? When it is wrong, why is it wrong? If you had to build more, what would you pick and why? Add your findings to your report.
-
polish writeup.
You can use whatever language you choose. If you would like to use external libraries, please check with me. In many cases, it will be fine provided that the library doesn't replace one of the major steps above. So, for example, feel free to use a logging library but it would not be appropriate to use a B+ tree library.
You can work in groups of up to 3 people (I would suggest working in groups of 2 though. Coordinating such a small project with 3 people will likely just create a lot of blockers.)
You must supply some automated testing for each of the major graded components of the system:
- 20 misc tooling
- 10 index
- 15 scans
- 15 joins (10 per join algo)
- 15 cost model
- 25 write up
Deliverable 1 In a git repository on Github, create a tag called hw01
.
Push your code (and tag), and add me to your repository (my username is dlm
).
Deliverable 2 On D2L, submit a PDF of your writeup (make sure in the writeup to include the url of your repo