notes.txt

Edit distance
=============

Levenshtein distance
--------------------

  * The Levenshtein distance between two string is the number of insertions,
    deletions or substitution required to change one string into the other.


Hierarchical clustering
=======================

  * Agglomerative / bottom-up approach: each string starts in its own cluster
    and pairs of clusters are merged as one one moves up the hierarchy.

    Complexity: O(n^3), special case: O(n^2)

  * Divisive / top-down: all strings start in one cluster and splits are
    performed recursively as one moves down the hierarchy.

    Complexity: O(2^n)


Sequence alignment
==================

Aligning 2 sequences (pairwise alignment), or 3 or more sequences (multiple
sequence alignment), by inserting gaps to align similar characters in
successive columns.

A different goal/method than clustering, but might be useful for discovering
similarities in strings, that e.g. Levenshtein can't discover.


k-means clustering, k-medoids, Expectation-maximization algorithm
=================================================================

  * Requires the number of clusters, k, be specified as a parameter.

  * Optimal solution: k = n clusters for n elements.
      "Intuitively then, the optimal choice of k will strike a balance between
       maximum compression of the data using a single cluster, and maximum
       accuracy by assigning each data point to its own cluster"
      (Wikipedia: Determining the number of clusters in a data set)

  * Several methods for deciding the number of clusters, k, a few given in the
    wikipedia article mentioned previously.


Other concepts to look into
===========================

  * Clustal W (multiple sequencd alignment)

  * k-means clustering

  * mixture of gaussians

  * gauss mixture model

  * Density-based spatial clustering of applications with noise (DBSCAN)
      - maybe only suitable for Euclidean distanced metric
      - doesn't require k to be specified a priori
      - fast for two dimensional clustering
    Similar: Ordering points to identify the clustering structure (OPTICS)