lec23

Subspace clustering

  • good clusters may exist in a subspace
  • clustering methods may be ineffective for some datasets in the normal space
  • linear combination of dimension or a subset of current dimensions
  • In the case where data is not axis aliged, then use linear combination

Axis aligned clusters

  • subset of original d dimensions
  • O(2d) choices
  • Build dimension enumeration tree (all possible combinations of dimensions)
  • project onto each new dimensionality and evaluate clustering

Clique method

  • grid over data
  • Every dimension is split into b bins/divisions
  • O(db)
  • Density = # of points (can have many definitions)
  • Keep track of dense cells in each dimension
  • tries to find axis aligned cells
  • Clique will join each cell that share a face
  • \(\frac{x_{range}}{x} \cdot \frac{y_{range}}{Y}\)
  • Bad in high dimensions
  • sensitive to bin boundaries
  • fast (can use itemset mining), but misses a lot of data points

Random projections

  • randomized to select "good" subspace
For i=1 maxiters:
   pick a random sead point xp
   for j=1 to max sample
      pick a sample of points
        score is the size of the cluster * dimenisonality (higher dimensions is better)

General subspaces

  • projective k-means
  • Generalization of PCA to subspaces
  • Projetedive kmeans/subspace PCA
  • Find direction of most variance/ find k subspace of varying dimensionalities
  • true clusters are orthogonal

algorithm

  • initialization
  • iterative refinement (Expectation maximation)
    • repeat until convergence
    • for each partition Di do a PCA(Di ) and determine the dimenionality
    • Di <- dim(PCA(Di ))
    • map point onto subspace, determine distance between projection and point, smallest distance wins