exam2

formulas and tidbits

  • possible clusters (brute force) = \(k^n\)
  • Bayes: \(P(A|B) = \frac{P(A) \cdot P(B|A)}{P(B)}\)
  • Lance-williams = \(d_{(ij)k} = \alpha_id_{ik} + \alpha_jd_{jk} + \beta d_{ij} + \gamma|d_{ik} - d{jk}|\) d variables represent the function to determine distance between points in each cluster
  • average distance between cluster = $$\frac{\sum_{x\inC_i}\sum_{y\in C_j}d(x,y)}{n_in_j}$$ is the average distance between all pairs of points.
  • empirical cumulative distribution is average of all values less than or equal to the current value in a list (can be used as step kernel).
  • Kernel density \(\hat f_k(x, h) = \frac{k(x+h/2) - k(x+h/2)}{h}\) where k is a kernel function
  • affinity matrix = a matrix where numbers closer to one are "similar" used to quantify relationships between clusters or data points

Algorithms

K-means

  • given number of clusters
  • randomly assign value to each centroid/cluster/mean
  • assign closest points in data to each cluster
  • assign centroid value with no points to random point
  • recalculate mean by taking the average of all points assigned to cluster
  • stop when means don't change
  • Time complexity: (O(ndk+1)\)

EM algorithm

  • split into two steps (expectation, maximization)
  • optimize parameters that maximize the log likelyhood function
  • not really sure this lines up with what we did in class, there was some fancy bayes, assuming c stuff going on
  • Time complexity depends on likelyhood function

DBSCAN

  • \(\epsilon\) is max distance a "neighbor" is allowed to be (reachable)
  • core points have more than min_neighbors/points
  • border points are "reachable" from core points, ie there is a core point with a border point as its neighbor
  • noise points are not reachable
  • the algorithm just finds all core points then border points, then tosses everythings else out as noise
  • single link = min distance between clusters
  • complete link = max distance between clusters
  • closest clusters form new clusters where original clusters are points
  • visualized using dendrogram

DENCLUE

  • Gradient function: $$\nabla f(x) = \frac{1}{nh^{d+2}}\sum \limits^n_{i=1}K\left(\frac{x-x_i}{h}\right)(x_i-x)$$
  • Hill climbs to local min or max, then assigns points to clusters based on local min or max

Spectral clustering

  • compute affinity matrix A
  • Compute Degree matrix
    • diagonal matrix
    • each entry is the sum of weights of all edges incident with node
  • Compute L (differs with method)
    • Average weight: L = A
    • Ratio Cut: L = D - A (used in normalized)
    • Normalized assymmetric cut: La = D-1 L
    • Normalized symmetric cut: La = D^-.5 L D^-.5
  • compute k(number of clusters) smallest eigenvalues/eigenvectors of L
  • compute matrix Y from eigenvalue matrix U using \(y_i = \frac{1}{\sqrt{\sum^k_{j=1}u^2_{n-j+1, i}}}(\vec{u_i}^T)\)
  • find clusters using k-means on Y

CLIQUE

  • cuts data space into a grid of cells
  • density threshold determines if cell is "dense"
  • adjoining dense cells are considered the same cluster
  • best for axis aligned data
  • sensitive to bin boundaries

Projective k-means (PCA)

  • good for data that is not axis aligned
  • PCA is hard

Random-projections

  • take a random sample of points
  • test all dimensions for the following
  • A = \(\{d_k| \: \|x_p^k - x\| \le w \forall x \in s_j\}\)
  • dimension is good if there is no point that has a distance greater than w in the particular dimensional subspace
  • good for data that is axis aligned

Regression

  • Linear regression \((D^TD)^{-1}D^Ty\)
  • Ridge regression: \(w = (X^TX + \alpha I)^{-1}X^Ty\)
  • SSE: \(\sum(Y-\hat Y)^2\) where \(\hat Y\) is the sum of the columns after being updated (multiplied) with the \(w\) and \(b\) found in regression