Data matrix
Data can be represented by an \(n \times d\) matrix
print('hi world')
Attributes
Numerica Attributes
Categorical Attributes
Distance and Angle
Distance
- \(||a-b||\)
Angle
- \(\cos \theta = \frac{a^Tb}{||a||}\)
Orthoganality
- If dot product is 0 then \(\theta\) is \(90^\circ\)
Orthogonal projection
The projection of b on a is:
- \(p = b_{||} = c_a = (\frac{a^Tb}{a^Ta})a\)
Mean and total Variance
D is a matrix
Mean
\(mean(D) = \mu = \(\frac{1}{n}\sum^n_{i=1}x_i\)
Total variance
- \(var(D) = \frac{1}{n}\sum\limits^n_{i=1}\delta(x_i, \mu)^2 = \frac{1}{n} \sum\limits^n_{i=1}||x_i - \mu||^2\)
- \(\frac{1}{n}(\sum\limits^n_{i=1}||x_i||^2) - ||\mu||^2\)
Centered Data matrix
- Centered data matrix is created by subtracting the mean from every value in the vector
Linear independence and Dimensionality
Row and Column Space
- column space is the set of linear combinations in the rows
- Row space is the set of linear combinations in the rows, column space of \(D^T\)
Linear Independence
linear dependent vectors can be written as a linear combination of other vectors in the matrix.
Dimension and Rank
- Dimensionality = number of columns
- Rank is the number of linearly independent columns or rows which is determined by converted the matrix to echelon form and counting the non-zero columns or rows. This is also reffered to as the dimensionality of the column space.
Probabilistic View
- Each numeric attribute \(X\) is a random variable: a function that assigns a real number to each outcome of an experiment
- X is a function, \(X:\mathcal{O} \rightarrow \mathbb{R}\)
- \(\mathcal{O} \)is the domain of \(X\)k
Probability mass function
- If \(X\) is discrete:
- \(f(x) = P(X=x) \text{ for all } x \in \mathbb{R}\)
- probablility that \(X = x\)
Probability density function
- if \(X\) is continuous, then \(P(X)=x) = 0\) for all \(x \in \mathbb{R}\)
- probability is so spread over range that probablility can only be measured over intervals \([a,b]\subset\mathbb{R}\) rather than at specific points.
- \(P(X\in[a,b]) = \displaystyle\int\limits^b_af(x)dx\)
- \(f(x) \ge 0\) for all x \(\in \mathbb{R}\)
- \(\displaystyle\int\limits^\infty_{-\infty}f(x)dx = 1\)
- ratio of the probability mass to the width of the interval (width given in \(\epsilon\))
Cumulative Distribution Function
- CDF \(F:\mathbb{R} \rightarrow [0,1]\)
- \(F(x) = P(X\le x)\) for all \(-\infty < x < \infty\)
- Discrete CDF: \(F(x) = P(X \le x) = \displaystyle{\sum\limits_{u\le x}}f(u)\)
- Continuous CDF: \(F(x) = P(X \le x) = \displaystyle\int\limits^x_{-\infty}f(u)du\)
Bivariate Random Variables
- Analyze two attributes together as a bivariate random variable
- \(\boldsymbol{X} = \begin{pmatrix}X_1 \\ X_2\end{pmatrix}\)
- \(X:\mathcal{O}\rightarrow \mathbb{R}^2\)
- Assigns each outcome a pair of real numbers, a 2 dimensional vector \(\begin{pmatrix}x_1 \\ x_2 \end{pmatrix} \in \mathbb{R}^2\)
Joint Probability Density Function
- \(\displaystyle{\sum\limits_x}f(x) = \displaystyle{\sum\limits_{x_1}}\displaystyle{\sum\limits_{x_2}} f(x_1, x_2) = 1\)
Joint Probability Density Function
- \(P(\boldsymbol{x} \in W) = \displaystyle\int\displaystyle\int\limits_{x \in W}f(x)dx = \displaystyle\int\displaystyle\int\limits_{(x_1,x_2)^T\in W}f(x_1,x_2)dx_1dx_2\)
*