Data mining
-
non-trivial extraction of implicit, previously unknown and potentially useful information.
-
exploration and analysis
Tasks
-
Regression
-
outlier detection
-
change detection
-
forecasting
-
classification
Exploring data analysis
-
n points with d attributes
-
n by d matrix
-
\(x_{ij}\) value of ith object jth attribute
-
values can be blank
-
Superscript is dimension j
-
Subscript is row i
Geometric view 1d
-
each dimension is \(\mathbb{R}\)
-
\(x_i = (x_{i1}, x_{i2}, ..., x_{id}\))
-
\(x_i \in \mathbb{R}^d\)
Probabilistic 1d
-
Random variable is a value that each member of population shares (age, hair color, etc)
-
\(\frac{\text{universe}}{population}\) is all possible values
-
What can you learn about the population from the sample
-
populations have parameters (mean, variance)
-
variance = \(\sigma^2\)
-
\(M_x = E[x] = \int^\infty_{-\infty}x p(x) dx\)
-
\(\hat{M_x} = \frac{\sum x_i}{n}\) estimated value of mean (hat is estimate)
Geometric 2d
-
\(p_i \in \mathbb{R}^2\)
-
\(p_i^T = (x_i, y_i) = 1e_1 + 2e_2\) e is standard unit vector in respective dimension
-
\(p_i = \begin{bmatrix}x_i \\ y_i \end{bmatrix} \)
-
Plot-able points
-
Point == vector
-
Magnitude is norm is length = \(||v|| = \sqrt{x^2 + y^2 + ... \)
-
dot product = \(A\cdot B = AB^T = \begin{bmatrix}a_1 & a_2 \end{bmatrix}\begin{bmatrix} b_1 \\ b_2
\end{bmatrix} = a_1b_1 + a_2b_2\)
-
orthogonal vectors have a dot product of 0
-
90 degree angle in 2 dimensions
-
\(\cos \theta = \frac{x \cdot y}{||x||||y||}\)
-
Distance = \(\sqrt{(x_1-x_2)^2 (y_1 - y_2)^2}\)
Correlation
-
\(\sigma_{xy} = \frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{n}\) is the covariance
-
correlation is the standardized/normalized covariance
-
\(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x\sigma_y} = \frac{\sigma_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} =
\frac{\frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{n}}{\sqrt{\frac{\sum(x-\mu_x)^2}{n} +
\frac{\sum(y-\mu_y)^2}{n}}} = \frac{x' \cdot y'}{||x'||||y'||} = \cos \theta\)
-
Center the data on the mean (subtraction) then find \(\cos \theta\) that is the correlation