Skip to content

Latest commit

 

History

History
95 lines (66 loc) · 3.64 KB

02.md

File metadata and controls

95 lines (66 loc) · 3.64 KB

Homework 2 - Unsupervised Learning

You may work individually or in teams of up to four on this homework. You will provide ONE submission for your team.

Intro

In this homework, you will get hands on experience in building an unsupervised learning system.

In this assignment, you will explore the UC Irvine Machine Learning Repository, select a data set from that repository, generate synthetic data, and implement k-means and DBSCAN, and assess the quality of the two algorithms on the two datasets. (Note that if you have your own dataset that you would like to use instead of the UCI repo, please, speak with me.)

Feel free to use whatever language or programming environment that you like. You may use external libraries for data structures, linear algebra, point-in-polygon, and other primitives, but you must implement an algorithm for generating synthetic data, k-means, DBSCAN, and the assessment metrics.

Part 1 - Get Data

Find a dataset

Go to the UC Irvine Machine Learning Repository and select a data set. Feel free to pick which ever you like. Note that the Iris dataset is a classic. While some of the data sets have labels, you will not need them. We will assess our clustering on this dataset using an intrinsic method.

Generate a synthetic dataset

In the next part, we will generate a synthetic dataset. We will use the synthetic dataset to evaluate clustering methods using an extrinsic method.

Let n > 3. Create a small program to sample points from the interior of n non-overlapping polygons, Let's label these polygons {1, 2, ..., n}. They can have holes but do not need to.

For each point, make sure to keep track to the x- and y-coordinates, as well as the label of the polygon from which it was sampled.

Next, add a some random noise to the data. Make sure that each noise point is not on the interior of any of your polygons. Label the noise points with 0.

Part 2 - Implement Algos

Implement k-means and DBSCAN. For k-means, you may want to run your algo multiple times to avoid getting stuck in a local min. Alternatively, you can implement an initialization procedure like k-means++ that has some theoretical grantees.

Part 3 - Implement Assessments

Implement two clustering measures:

  • Extrinsic: Implement an external measure of a clustering given a partitioning. (Recommended: Purity or Conditional Entropy)
  • Intrinsic: Implement an internal measure of a clustering. (Recommended: Silhouette Coeff)

Part 4 - Putting it together

Use your algorithms and datasets to explore your data. Try out the two clustering algorithms and use the clustering measures to pick parameters and determine which algorithms work best on which data sets. Explore how changing the amount of noise and number of samples in your synthetic data affects the results.

Part 5 - Write up

Provide a medium length (3-5 page) writeup of your project. Specifically, your writeup must:

  1. describe how one can use your system
  2. provide an example input and output
  3. give the ssh uri of the git repo (e.g. the one of the form [email protected]:....git)
  4. discuss your findings in Part 4.

Submission

The submission will have two deliverables. Both deliverables should be submitted before the deadline.

Deliverable 1 On D2L, submit a PDF of your writeup.

Deliverable 2 In a git repository on Github, create a tag called hw02. Push your code (and tag), and add me to your repository (my username is dlm). If you are using the same repo from a previous assignment, no need to readd me.