Little bit of Machine learning

Already MIGRATED to Confluence


Review:
n features
m inputs (each input has n values -one value for each feature-)
m outputs
Maths representation:
matrix of m * n.
X(i,j) is the value of input i for feature j
for each input Xi, there is an output Yi: i.e. Yi = H(Xi) = H([X(i,1), X(i,2) .., X(i,n)])

The unknown element here is "the value of H" for inputs other than those in the training set (m inputs / outputs)
Now the question : How to compute H?
- First, H is supposed to preserve values of our trainig set: (H(Xi) - Yi) ~ 0.
  One way to model this constraint it to minimize the Mean Squared Error: 



- Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. - Tom Mitchell -

- It seems that any learning problem could be seen as an estimation (or prediction) of function values. It means that if  h  is a function of an input x, the learning-problem seek to predict what will be the output of h(x).

- Learning problems are classified onto two categories:
  • Supervised learning: When we know in advance output domain h(X). It can be either:
    • Regression problem: continuous output, meaning that h is a continuous function.
    • Classification problem: discrete output, meaning that h is a discrete function.


  • Unsupervised learning: when we don't have any idea about output domain. It's used to derive structure from inputs by:
    • Clustering them in respect of some criteria (variables): "clustering learning"
    • Recognising the useful information (here we assume the input is a mix of  useful and noise data): "Recognising learning"

Regression problem
- Linear regression:  here the function h is a line define by: h(x)=Ax + B; where A and B are the optimal (min) values against the following cost-function: "Mean squared error" of  h(xi) - yi. ([xi,yi] belongs to training set).
"Gradient descent" is an algorithm giving an estimate of A & B.

- Multivariate linear regression: thetas depict the contribution of each feature (x1, x2, ...) to estimate the output:

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n

General formula of gradient descent for multi variables:

θ j := θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) for j := 0...n

- Speed up gradient descent:
  • Feature scaling and Mean normalization = get input values in roughly the same "small" range:
    x i := x i − μ i s i
    Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
  • Learning rate: if too small ==> slow convergence. If too large: ==> No convergence.
- Intuition behind polynomial regression: all features are multiplicatively  derived from only one.
- Gradient descent gives an algorithmic way to estimate h. There is an second way to do so analytically by resolving the Normal Equation (yet, very slow is number of features is large):
θ = ( X T X ) − 1 X T y
If XTX is noninvertible, the common causes might be having :
  • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  • Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".
Classification problem
- The hypothesis function h takes a new form called "Sigmoid Function" or "Logistic Function":

- Indeed, this new form of h could be seen as a probability distribution (ie: given an input x, what's the probability of the output y)
- "Mean squared error" is not convex in case of logistic regression (hens, not convenient  for gradient descent). Instead, the new cost function to use is:
 
- A vectorized implementation of gradient descent is:
θ := θ − α m X T ( g ( X θ ) − y → )

No comments: