Machine Learning
Categorization
Supervised vs unsupervised
Supervised machine learning is when there is a labeled dataset for learning, whereas unsupervised learning does not.
Gradient Descent
Example: linear regression
When fitting a line to a set of points, we are trying to minimize the R-square value:
We can adjust the values of and as
where is the learning rate (or step size).
Stochastic Gradient Descent
Similar to Q-learning, stochastic gradient descent utilizes the changes to the approximations as it progresses through the dataset, leading to a faster convergence.
Classification
For some data point , we say is a vector with some number of dimensions which describe the data.
Trying to separate terms with a line in standard form , for a vector , the line it describes (assuming ) will be perpendicular. Any data point above the line will give a positive dot product with .
In this case, the loss function would be
The goal is to minimize the number of points misclassified and maximize the margin around the line that divides the data.
Introducing an intercept term
Since the decision boundary will not always pass through the origin, we add an intercept term , called the bias term. Now we have to define the slope and to define the intercept.
Now, our loss function can be represented as:
The 1 comes from the margin around the decision boundary, which incentivizes space between points and the line.
The width of the margin depends on the two points used, and can be expressed as
Now the goals (maximizing the margin and minimizing the classification error) can be expressed as:
Therefore, the overall objective is to minimize the value of
There is an inherent trade-off between a wide margin and minimal classification errors, since there could be outliers. To control this relation between the two objectives, we add a constant to the classification term:
Feature Transforms
For data that cannot be separated by a single linear separator, applying a transformation allows linear classification.
An example would be data in a 1D space. Mapping onto a parabola allows for a linear separator to classify a data class originally surrounded by another type of data.
One downside to this approach is choosing the correct function. For scattered data in multidimensional space, it would be impractical to choose a transform to use linear classification.
Quantifying Confidence using the Sigma Function
To determine the confidence of a given classification, we can imagine mapping the distance to the separator onto a sigmoid function. The distance would be and a sigmoid function is given by:
We can also use the sigmoid function and gradient descent to determine the correct boundary to classify data (called a logistic regression). The loss function in this case (called binary cross entropy) would be
For each data point, this collapses into either quantifying how correct the classification is or how incorrect it is. The log ensures the loss goes to 0 as the confidence goes to 1.
Finding the gradient of , we find
where is the indicator function, defined as