Lec2

Data mining

  • non-trivial extraction of implicit, previously unknown and potentially useful information.
  • exploration and analysis

Tasks

  • Regression
  • outlier detection
  • change detection
  • forecasting
  • classification

Exploring data analysis

  • n points with d attributes
  • n by d matrix
  • \(x_{ij}\) value of ith object jth attribute
  • values can be blank
  • Superscript is dimension j
  • Subscript is row i

Geometric view 1d

  • each dimension is \(\mathbb{R}\)
  • \(x_i = (x_{i1}, x_{i2}, ..., x_{id}\))
  • \(x_i \in \mathbb{R}^d\)

Probabilistic 1d

  • Random variable is a value that each member of population shares (age, hair color, etc)
  • \(\frac{\text{universe}}{population}\) is all possible values
  • What can you learn about the population from the sample
  • populations have parameters (mean, variance)
  • variance = \(\sigma^2\)
  • \(M_x = E[x] = \int^\infty_{-\infty}x p(x) dx\)
  • \(\hat{M_x} = \frac{\sum x_i}{n}\) estimated value of mean (hat is estimate)

Geometric 2d

  • \(p_i \in \mathbb{R}^2\)
  • \(p_i^T = (x_i, y_i) = 1e_1 + 2e_2\) e is standard unit vector in respective dimension
  • \(p_i = \begin{bmatrix}x_i \\ y_i \end{bmatrix} \)
  • Plot-able points
  • Point == vector
  • Magnitude is norm is length = \(||v|| = \sqrt{x^2 + y^2 + ... \)
  • dot product = \(A\cdot B = AB^T = \begin{bmatrix}a_1 & a_2 \end{bmatrix}\begin{bmatrix} b_1 \\ b_2 \end{bmatrix} = a_1b_1 + a_2b_2\)
  • orthogonal vectors have a dot product of 0
  • 90 degree angle in 2 dimensions
  • \(\cos \theta = \frac{x \cdot y}{||x||||y||}\)
  • Distance = \(\sqrt{(x_1-x_2)^2 (y_1 - y_2)^2}\)

Correlation

  • \(\sigma_{xy} = \frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{n}\) is the covariance
  • correlation is the standardized/normalized covariance
  • \(\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x\sigma_y} = \frac{\sigma_{xy}}{\sqrt{\sigma^2_x \sigma^2_y}} = \frac{\frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{n}}{\sqrt{\frac{\sum(x-\mu_x)^2}{n} + \frac{\sum(y-\mu_y)^2}{n}}} = \frac{x' \cdot y'}{||x'||||y'||} = \cos \theta\)
  • Center the data on the mean (subtraction) then find \(\cos \theta\) that is the correlation