Subspace clustering
- good clusters may exist in a subspace
- clustering methods may be ineffective for some datasets in the normal space
- linear combination of dimension or a subset of current dimensions
- In the case where data is not axis aliged, then use linear combination
Axis aligned clusters
- subset of original d dimensions
- O(2d) choices
- Build dimension enumeration tree (all possible combinations of dimensions)
- project onto each new dimensionality and evaluate clustering
Clique method
- grid over data
- Every dimension is split into b bins/divisions
- O(db)
- Density = # of points (can have many definitions)
- Keep track of dense cells in each dimension
- tries to find axis aligned cells
- Clique will join each cell that share a face
- \(\frac{x_{range}}{x} \cdot \frac{y_{range}}{Y}\)
- Bad in high dimensions
- sensitive to bin boundaries
- fast (can use itemset mining), but misses a lot of data points
Random projections
- randomized to select "good" subspace
For i=1 maxiters:
pick a random sead point xp
for j=1 to max sample
pick a sample of points
score is the size of the cluster * dimenisonality (higher dimensions is better)
General subspaces
- projective k-means
- Generalization of PCA to subspaces
- Projetedive kmeans/subspace PCA
- Find direction of most variance/ find k subspace of varying dimensionalities
- true clusters are orthogonal
algorithm
- initialization
-
iterative refinement (Expectation maximation)
- repeat until convergence
- for each partition Di do a PCA(Di ) and determine the dimenionality
- Di <- dim(PCA(Di ))
- map point onto subspace, determine distance between projection and point, smallest distance wins