-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.txt
67 lines (42 loc) · 2.08 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Edit distance
=============
Levenshtein distance
--------------------
* The Levenshtein distance between two string is the number of insertions,
deletions or substitution required to change one string into the other.
Hierarchical clustering
=======================
* Agglomerative / bottom-up approach: each string starts in its own cluster
and pairs of clusters are merged as one one moves up the hierarchy.
Complexity: O(n^3), special case: O(n^2)
* Divisive / top-down: all strings start in one cluster and splits are
performed recursively as one moves down the hierarchy.
Complexity: O(2^n)
Sequence alignment
==================
Aligning 2 sequences (pairwise alignment), or 3 or more sequences (multiple
sequence alignment), by inserting gaps to align similar characters in
successive columns.
A different goal/method than clustering, but might be useful for discovering
similarities in strings, that e.g. Levenshtein can't discover.
k-means clustering, k-medoids, Expectation-maximization algorithm
=================================================================
* Requires the number of clusters, k, be specified as a parameter.
* Optimal solution: k = n clusters for n elements.
"Intuitively then, the optimal choice of k will strike a balance between
maximum compression of the data using a single cluster, and maximum
accuracy by assigning each data point to its own cluster"
(Wikipedia: Determining the number of clusters in a data set)
* Several methods for deciding the number of clusters, k, a few given in the
wikipedia article mentioned previously.
Other concepts to look into
===========================
* Clustal W (multiple sequencd alignment)
* k-means clustering
* mixture of gaussians
* gauss mixture model
* Density-based spatial clustering of applications with noise (DBSCAN)
- maybe only suitable for Euclidean distanced metric
- doesn't require k to be specified a priori
- fast for two dimensional clustering
Similar: Ordering points to identify the clustering structure (OPTICS)