Lance-williams = \(d_{(ij)k} = \alpha_id_{ik} + \alpha_jd_{jk} + \beta d_{ij} + \gamma|d_{ik} -
d{jk}|\) d variables represent the function to determine distance between points in each cluster
average distance between cluster = $$\frac{\sum_{x\inC_i}\sum_{y\in
C_j}d(x,y)}{n_in_j}$$ is the average distance between all pairs of points.
empirical cumulative distribution is average of all values less than or equal to the current value in a list (can be used as step kernel).
Kernel density \(\hat f_k(x, h) = \frac{k(x+h/2) - k(x+h/2)}{h}\) where k is a kernel function
affinity matrix = a matrix where numbers closer to one are "similar" used to quantify relationships between clusters or data points
Algorithms
K-means
given number of clusters
randomly assign value to each centroid/cluster/mean
assign closest points in data to each cluster
assign centroid value with no points to random point
recalculate mean by taking the average of all points assigned to cluster
stop when means don't change
Time complexity: (O(ndk+1)\)
EM algorithm
split into two steps (expectation, maximization)
optimize parameters that maximize the log likelyhood function
not really sure this lines up with what we did in class, there was some fancy bayes, assuming c stuff going on
Time complexity depends on likelyhood function
DBSCAN
\(\epsilon\) is max distance a "neighbor" is allowed to be (reachable)
core points have more than min_neighbors/points
border points are "reachable" from core points, ie there is a core point with a border point as its neighbor
noise points are not reachable
the algorithm just finds all core points then border points, then tosses everythings else out as noise
Single link/complete link
single link = min distance between clusters
complete link = max distance between clusters
closest clusters form new clusters where original clusters are points