Week 1
Introduction
Supervised Learning
Unsupervised Learning
Model and Cost Function
Model Presentation
Cost Funtion
Keep 2 parameters
== Squared Error Function
Eliminate 1 paramater
Our objective is to get the best possible line. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0
A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.
Parameter Learning
Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.
At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.
Simplize with 1 parameter
Gradient Descent For Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :
==> batch gradient descent
Why 1/2 ?
The '1/2' portion is a calculus trick, so that it will cancel with the '2' which appears in the numerator when we compute the partial derivatives. This saves us a computation in the cost function.
Linear Argebra Review
Matrix Vector Multiplication
Matrix Matrix Multiplication
Matrix Matrix Multiplication TRICK
Matrix Vector Multiplication TRICK
Matrix Multiplication Properties
matrix matrix multiplication is not commutative
matrix matrix multiplication is associative
Transpose
Inverse
Week2
Multivariate Linear Regression
Multiple Feature
Radient Decense for Multiple Variable
Feature Scaling
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
Learning Rate
Polyomial Regression
We can improve our features and the form of our hypothesis function in a couple different ways.
We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Computing Parameter Analyticaly
Normal Equation
Normal Equation Uninvertibility
Week 3 - Logistic Regression
Classification and Representation
Hypothesis Representation
Sigmoid Function," also called the "Logistic Function"
Decision Boundary
Logistic Regression Model
Cost function
Simplified Cost Function and Gradient Descent
Multiclass Classification
ONE-VS_ALL
Solving the Problem of Overfitting
The Problem of Overfitting
There are two main options to address the issue of overfitting:
1) Reduce the number of features:
Manually select which features to keep.
Use a model selection algorithm (studied later in the course).
2) Regularization
Keep all the features, but reduce the magnitude of parameters θj.
Regularization works well when we have a lot of slightly useful features.
Regularized Linear Regression
Gradient Descent
Normal Equation
Regularized Logistic Regression
Week 4 - Neural Networks
Neural Networks
Model Representation I
a(j)i="activation" of unit i in layer j
Θ(j)=matrix of weights controlling function mapping from layer j to layer j+1
Model Representation II
Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".
Application
Examples and Intuitions
Multiclass Classification
Week 5 - Neural Networks: Learning ⭐
Cost Function and Backpropagation
Backpropagation in Practice
Application of Neural Networks
❓
Backpropagation Intuition [BEST]
Implementation Note: Unrolling Parameters
Gradient Checking
Random Initialization
Put It Together ⭐
Week 6 - Advice for Applying Machine Learning
Evaluating a Learning Algorithm
Bias vs. Variance
Evaluating a Hypothesis
Model Selection and Train/Validation/Test Sets
Diagnosing Bias vs. Variance
Learning Curves
Deciding What to Do Next Revisited
Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.
Building a Spam Classifier
Handling Skewed Data
Using Large Data Sets
click to edit
Week 7 -Support Vector Machines
Large Margin Classification
Kernels
SVMs in Practice
Week 8 -Unsupervised Learning
Clustering
Dimensionality Reduction
Principal Component Analysis
Applying PCA
Week 9
Anomaly Detection
Recommender Systems
Density Estimation
Building an Anomaly Detection System
Multivariate Gaussian Distribution (Optional)
Predicting Movie Ratings
Collaborative Filtering
Low Rank Matrix Factorization
Week 10 - Large Scale Machine Learning
Gradient Descent with Large Datasets
Advanced Topics
Stochastic Gradient Descent Convergence
Week 11 - Application Example: Photo OCR
1. Normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range
2. Compute the covariance matrix of the data
3. run SVD on it to compute the principal components
[U, S, V] = svd(Sigma), where U will contain the principal components and S will contain a diagonal matrix.
Learning With Large Datasets
Stochastic Gradient Descent
Mini-Batch Gradient Descent
click to edit