Clustering
-
Distance: Subjective criteria (dependent on application)
-
cluster: also dependent on application
-
Points within a cluster, distance should be small
-
distance between clusters should be large
-
clustering is a partition of the dataset into k clusters
-
Determining the number of clusters is hard
-
break dataset into k parts such that $$\cup D_i = D$$ without overlaps
Distance
-
$$d(x, y) = \sqrt{\sum \left (x_i-y_i\right )^2}$$ euclidean distance
-
Manhattan distance (diagonal is x + y) in triangle
Score for clustering
-
mean/centroid is the mean of all the points in a cluster
-
median/medioid is the point closest to the mean
-
np hard to find the optimal partition of clustering in k groups
naive algorithm for partitioning
-
enumeration problem
-
enumerate all k-partitions
-
∀ k partitions, compute score, keep best, output best
-
O(|D|) for each score
-
\(O(K^n * |D|)\) is the total complexity
k-means clustering
-
uses a heuristic, not optimal
-
finds locally optimal clusters
-
make a ruandom parition, choose k random points as the k means
-
recompute the mean
-
iterate until convergence (small number of iterations) rarely more than 10.
-
O(tDKd) t can be treated as constant, linear time algorithm
Em algorithm
-
What is \(P(x_i|C_i)\)
-
assume some model for the parametric cluster
-
model that is normally assumed is the normal distribution
-
posterior probabliility = likelyhood * prior/normalization