slides/2014-03-06-Clustering.html

<!DOCTYPE html>
<html>
  <head>
    <title>Data Mining</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <style type="text/css">
      @import url(http://fonts.googleapis.com/css?family=Droid+Serif);
      @import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);

      body {
        font-family: 'Droid Serif';
        font-size: 25px;
      }
      .remark-slide-content {
        padding: 1em 2em 1em 2em;
      }
      h1, h2, h3 {
        font-family: 'Yanone Kaffeesatz';
        font-weight: 400;
        margin-top: 0;
        margin-bottom: 0;
      }
      h1 { font-size: 3em; }
      h2 { font-size: 1.8em; }
      h3 { font-size: 1.4em; }
      .footnote {
        position: absolute;
        bottom: 3em;
      }
      ul { margin: 8px;}
      li p { line-height: 1.25em; }
      .red { color: #fa0000; }
      .large { font-size: 2em; }
      a, a > code {
        color: rgb(249, 38, 114);
        text-decoration: none;
      }
      code {
        -moz-border-radius: 3px;
        -web-border-radius: 3px;
        background: #e7e8e2;
        color: black;
        border-radius: 3px;
      }
      .tight-code {
        font-size: 20px;
      }
      .white-background {
        background-color: white;
        padding: 10px;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
      .limit-size img {
        height: auto;
        width: auto;
        max-width: 1000px;
        max-height: 500px;
       }
      em { color: #80cafa; }
      .pull-left {
        float: left;
        width: 47%;
      }
      .pull-right {
        float: right;
        width: 47%;
      }
      .pull-right ~ p {
        clear: both;
      }
      #slideshow .slide .content code {
        font-size: 1.6em;
      }
      #slideshow .slide .content pre code {
        font-size: 1.6em;
        padding: 15px;
      }
      .inverse {
        background: #272822;
        color: #e3e3e3;
        text-shadow: 0 0 20px #333;
      }
      .inverse h1, .inverse h2 {
        color: #f3f3f3;
        line-height: 1.6em;
      }

      /* Slide-specific styling */
      #slide-inverse .footnote {
        bottom: 12px;
        left: 20px;
      }
      #slide-how .slides {
        font-size: 1.6em;
        position: absolute;
        top:  151px;
        right: 140px;
      }
      #slide-how .slides h3 {
        margin-top: 0.2em;
      }
      #slide-how .slides .first, #slide-how .slides .second {
        padding: 1px 20px;
        height: 90px;
        width: 120px;
        -moz-box-shadow: 0 0 10px #777;
        -webkit-box-shadow: 0 0 10px #777;
        box-shadow: 0 0 10px #777;
      }
      #slide-how .slides .first {
        background: #fff;
        position: absolute;
        top: 20%;
        left: 20%;
        z-index: 1;
      }
      #slide-how .slides .second {
        position: relative;
        background: #fff;
        z-index: 0;
      }

      .center {
        float: center;
      }

      /* Two-column layout */
      .left-column {
        width: 48%;
        float: left;
      }
      .right-column {
        width: 48%;
        float: right;
      }
      .right-column img {
        max-width: 120%;
        max-height: 120%;
      }

      /* Tables */
      table {
        border-collapse: collapse;
        margin: 0px;
      }
      table, th, td {
        border: 1px solid white;
      }
      th, td {
        padding: 7px;
      }

    </style>
  </head>
  <body>
    <textarea id="source">


name: inverse
layout: true
class: left, top, inverse

---

# Clustering

---

## Types of Models

  + Classifiers
  + Regressions
  + *Clustering*
  + Outlier

???

## Details

  + Classifiers
    + describes and distinguishes cases. Yelp may want to find a
    category for a business based on the reviews and business description
  + Regressions
    + Predict a continuous value. Eg. predict a home's selling
    price given sq footage, # of bedrooms
  + Clustering
    + find "natural" groups of data *without labels*
  + Outlier
    + find anomalous transactions, eg. finding fraud for credit cards

---

## Clustering

  + Group together similar items
  + Separate dissimilar items
  + Automatically discover groups without providing labels

???

## Perspectives

  + Similar items: again, metrics of similarity critical in defining these
    groups
  + Marking boundaries between different classes
  + Type of groups unknown before hand. Out of many attributes, what tend to be
    shared?

---

## Machine Learning

  + Supervised
  + Unsupervised
  + Semi-supervised
  + Active

???

## Definitions

  + Supervised
    + Given data with a label, predict data without a
    label
  + Unsupervised
    + Given data without labels, group "similar" items
    together
  + Semi-supervised
    + Mix of the above: e.g., unsupervised to find groups,
    supervised to label and distinguish borderline cases
  + Active
    + Starting with unlabeled data, select the most helpful cases for a
    human to label

---

## Clustering Applications

  + Gain insight into how data is distributed
  + Discover outliers
  + Preprocessing step to bootstrap labeling

???

## Apps

  + Closest we have to "magic box": put structured data in, see what groups may
    exist
  + You want labeled data, but where to start?  How many classes? What to name
    them?
    + Cluster data, investigate examples.
    + Hand label exemplary cases
    + Choose names that distinguish groups
    + Run classifier on labeled data, compare with clustering, examine errors,
      repeat

---

## Yelp Examples

  + User groups based on usage, reviewing habits, feature adoption
  + Businesses: when should a new category be created, what should it be called?
  + Reviews: for a particular business, are there common themes. Show better
    variety?

???

## Examples

  + User groups may be trend spotters, "lurkers", travelers, early adopters
  + Do we need a New American and American category? How similar are these
    categories?
  + Does a reviewer need to read 10 reviews about great food, so-so service?
    Maybe providing different view points helps give a better picture

---

## Intuition

  + Intuition => Mathematical Expression => Solution => Evaluation
  + High intra-class similarity
  + Low inter-class similarity
  + Interpretable

???

## Good Clusters

  + Just like all data mining, needs to be used to take action
  + Can't take action if you don't understand the results
  + Trade-offs: testing shows it works, but you don't understand it

---

## Methods

  + Partitioning
    + Construct ```k``` groups, evaluate fitness, improve groups
  + Hierarchical
    + Agglomerate items into groups, creating "bottom-up" clusters; or divide set into ever smaller groups, creating "top-down" clusters
  + Density
    + Find groups by examining continuous density within a potential
    group
  + Grid
    + Chunk space into units, cluster units instead of individual records

???

## Algorithms

  + Partitioning
    + Method similar to gradient descent: find some grouping,
    evaluate it, improve it somehow, repeat. k-means.
  + Hierarchical
    + Build groups 1 "join" at a time, examining distance between
    two things that can be joined together, if close, combine groups. Reverse:
    divisive.
  + Density
    + Many of the above methods just look for distance.  This method
    tries to find groups that might be strung out, but maintain a density.  Think
    about an asteroid belt.  It is one group, but not clustered together in a way
    you typically think.
  + Grid
    + Can speed up clustering and provide similar results

---

## k-means

  + Start: Randomly pick ```k``` centers for clusters
  + Repeat:
    + Assign all other points to their closest cluster
    + Recalculate the center of the cluster

???

## Iterative

  + Start at a random point, find step in right direction, take step,
    re-evaluate

---

## Example

<img src="img/kmeansclustering.jpg" width=110% />

???

## Process

  + We pick some nodes at random, mark with a cross
  + Find other points that are closest to the crosses
  + Find new *centroid* based on the average of all points
  + Start again
  + img: http://apandre.wordpress.com/visible-data/cluster-analysis/

---

## Distance

  + *Centroid* is the average of all points in a cluster; the center
  + Different distance metrics for real numbers
  + But how to find "average" of binary or nominal data?

???

## You Can't

  + k-means is used for numerical data

---

## Normalization

  + Cluster cities by average temperature and population attributes
  + ```<x,y> = <temperature, population>```
  + Using Euclidean distance, which attribute will affect similarity more?

???

## Un-normalized

  + Population: it is a much bigger number, will contribute much more to
    distance
  + Artificially inflating importance just because units are different

---

## Normalization Techniques

  + Z-score
    + ```(v - mean) / stddev```
  + Min-max
    + ```(v - min) / (max - min)```
  + Decimal
    + ```* 10^n``` or ```/ 10^n```
  + Square
    + ```x**2```
  + Log
    + ```log(x)```

???

## Useful for?

  + Z-score
    + 1-pass normalization, retaining information about stdev
  + Min-max
    + keep within expected range, usually [0-1]
  + Decimal
    + easy to apply
  + Square
    + keep inputs positive
  + Log
    + de-emphasize differences between large numbers

---

## Local Optima

<img src="img/k-means-local.png" width=100% />

???

## No Guarantee

  + Since there are many possible stable centers, we may not end up at the best
    one
  + How can we improve our odds of finding a good separation?
    + Why did we end up here? starting points
    + Choose different starting points
    + Compare results
  + Other problems? Mouse

---

## Uneven Groups

<img src="img/k-means-mouse.png" width=80% />

???

## k-means

  + k-means is good for similarly sized groups, or at least groups that are
    similar distance between other members
  + Other problems that would pull the centroid away from the real groups?
    + Outliers
  + img: http://en.wikipedia.org/wiki/K-means_clustering

---

## Medoids

  + Instead of finding a *centroid* find a *medoid*
  + Medoid: actual data point that represents median of the cluster
  + PAM: Partitioning Around Medoids

???

## Trade-offs

  + PAM more expensive to evaluate
  + Scales poorly, since we need to evaluate many more medoids with many more
    points

---

## Example

<img src="img/k-medoids.png" width=100% />

???

## Stability

  + No stability between real clusters
  + Outliers can't pull centroid far out of actual cluster
  + img: http://en.wikipedia.org/wiki/K-medoids

---

# *Break*

  + Do not confuse Medoid with Metroid

<img src="img/screenshot_metroid2.jpg" width=70% />


???

## Note

  + img: http://stealthboy.com/~msherman/metroid.html

    </textarea>
    <script src="production/remark-0.5.9.min.js" type="text/javascript">
    </script>
    <script type="text/javascript">
      var slideshow = remark.create();
    </script>
  </body>
</html>