lec17

Clustering

  • Distance: Subjective criteria (dependent on application)
  • cluster: also dependent on application
  • Points within a cluster, distance should be small
  • distance between clusters should be large
  • clustering is a partition of the dataset into k clusters
  • Determining the number of clusters is hard
  • break dataset into k parts such that $$\cup D_i = D$$ without overlaps

Distance

  • $$d(x, y) = \sqrt{\sum \left (x_i-y_i\right )^2}$$ euclidean distance
  • Manhattan distance (diagonal is x + y) in triangle

Score for clustering

  • mean/centroid is the mean of all the points in a cluster
  • median/medioid is the point closest to the mean
  • np hard to find the optimal partition of clustering in k groups

naive algorithm for partitioning

  • enumeration problem
  • enumerate all k-partitions
  • ∀ k partitions, compute score, keep best, output best
  • O(|D|) for each score
  • \(O(K^n * |D|)\) is the total complexity

k-means clustering

  • uses a heuristic, not optimal
  • finds locally optimal clusters
  • make a ruandom parition, choose k random points as the k means
  • recompute the mean
  • iterate until convergence (small number of iterations) rarely more than 10.
  • O(tDKd) t can be treated as constant, linear time algorithm

Em algorithm

  • What is \(P(x_i|C_i)\)
  • assume some model for the parametric cluster
  • model that is normally assumed is the normal distribution
  • posterior probabliility = likelyhood * prior/normalization