lec17

Clustering

Distance: Subjective criteria (dependent on application)
cluster: also dependent on application
Points within a cluster, distance should be small
distance between clusters should be large
clustering is a partition of the dataset into k clusters
Determining the number of clusters is hard
break dataset into k parts such that $$\cup D_i = D$$ without overlaps

Distance

$$d(x, y) = \sqrt{\sum \left (x_i-y_i\right )^2}$$ euclidean distance
Manhattan distance (diagonal is x + y) in triangle

Score for clustering

mean/centroid is the mean of all the points in a cluster
median/medioid is the point closest to the mean
np hard to find the optimal partition of clustering in k groups

naive algorithm for partitioning

enumeration problem
enumerate all k-partitions
∀ k partitions, compute score, keep best, output best
O(|D|) for each score
\(O(K^n * |D|)\) is the total complexity

k-means clustering

uses a heuristic, not optimal
finds locally optimal clusters
make a ruandom parition, choose k random points as the k means
recompute the mean
iterate until convergence (small number of iterations) rarely more than 10.
O(tDKd) t can be treated as constant, linear time algorithm

Em algorithm

What is \(P(x_i|C_i)\)
assume some model for the parametric cluster
model that is normally assumed is the normal distribution
posterior probabliility = likelyhood * prior/normalization