Please enable JavaScript.
Coggle requires JavaScript to display documents.
AI Week 3, Gradient Descent, ((x1^(1),x2^(1),...,xd^(1)),y(1)),...…
AI Week 3
Examples
spam detection in emails (classification problem with two possible values)
stock price prediction (regression problem with a real value output
distinguishing addresses in mail (classification problem with many different values - may be local and then the values are actual addresses or may be general and therefore values could be local sorting depots based on postcode area)
Linear Regression
Regression: learning a function that captures the trend between input and output values, which we can then use to predict outputs for new input values
Univariate Linear Regression:
we have one input attribute so we have a visual trend when plotting the points and can separate values with a straight line
Multivariate Linear Regression:
we have multiple input attributes - w as a vector has more dimensions
Cost function: this tells us how bad a line is; in other words, how well it fits the data in its entirety; loss function is the difference between the line and data for a single point
An example of a cost function is mean square loss (see notes for equation) which is useful because it's always non-negative
For our set of values, we want to find f(x) where x is a value in the set which gives us an appropriate output value y. We use input-output pairs as training data - these may have multiple input values
We have a training phase (where training data is fed into our suitable machine learning algorithm in order to produce a suitable function) and a test phase (where we use the function on previously unseen test data to test the suitability of the function)
Machine Learning is prevalent because...
We can apply
generality
in order to solve problems of the same type rather than having to start from scratch each time
The algorithm is
adaptable
so can change the function based on input-output values
It's
applicable
in many areas, since its often difficult to know how to hardcode a solution or see how data can be split
Gradient Descent
A strategy to minimise cost function
We repeatedly step in the direction of steepest descent until no more change occurs / its below a threshold (convergence)
The direction of steepest descent is calculated by taking the derivative at the current location to get the slope of the tangent and moving in the negative direction
Steps get smaller because the gradient is shallower near a peak, even though alpha (the learning rate, typically a small number like 0.01) is a constant
:star:
STOCHASTIC GRADIENT DESCENT: a randomly generated subset of the data is used which is useful when there are many data points
if learning rate is too small, it will take a long time to compute and if it is too large, we may overshoot the minimum and start to oscillate or increase the error with each iteration
((x1^(1),x2^(1),...,xd^(1)),y(1)),...,((x1^(n),x2^(n),...,xd^(n)),y(n))
SUPERVISED LEARNING
If features can be autonomously learned for various domains, the machine learning process becomes general in so far as the feature learning’s performance is comparable or superior to that using hand crafted knowledge.
Overfitting is an example of when generality hasn't been accounted for
UNIVARIATE NON-LINEAR REGRESSION: m-th order polynomial regression model - we get curves rather than lines
The application requires that the software customize to its operational environment after it is fielded