MACHINE LEARNING COURSERA

Week 1

Introduction

Supervised Learning

Unsupervised Learning

Model and Cost Function

Model Presentation

Cost Funtion

Keep 2 parameters

== Squared Error Function

Eliminate 1 paramater

Our objective is to get the best possible line. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.

Parameter Learning

Gradient Descent

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.

At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.

Simplize with 1 parameter

Gradient Descent For Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :

==> batch gradient descent

Why 1/2 ?
The '1/2' portion is a calculus trick, so that it will cancel with the '2' which appears in the numerator when we compute the partial derivatives. This saves us a computation in the cost function.

Linear Argebra Review

Matrix Vector Multiplication

Matrix Matrix Multiplication

Matrix Matrix Multiplication TRICK

Matrix Vector Multiplication TRICK

Matrix Multiplication Properties

matrix matrix multiplication is not commutative
matrix matrix multiplication is associative

Transpose

Inverse

Week2

Multivariate Linear Regression

Multiple Feature

Radient Decense for Multiple Variable

Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.

Learning Rate

Polyomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Computing Parameter Analyticaly

Normal Equation

Normal Equation Uninvertibility

Week 3 - Logistic Regression

Classification and Representation

Hypothesis Representation

Sigmoid Function," also called the "Logistic Function"

Decision Boundary

Logistic Regression Model

Cost function

2017-12-22_074258

Simplified Cost Function and Gradient Descent

Simplified Cost Function

Link Title

Giống Linear Regression, chỉ khác ở h(x) thôi

Multiclass Classification

ONE-VS_ALL

ID - Front

Solving the Problem of Overfitting

The Problem of Overfitting

Selfie - Copy

There are two main options to address the issue of overfitting:

1) Reduce the number of features:

Manually select which features to keep.
Use a model selection algorithm (studied later in the course).
2) Regularization

Keep all the features, but reduce the magnitude of parameters θj.
Regularization works well when we have a lot of slightly useful features.

Regularized Linear Regression

Gradient Descent

Selfie - Copy

Normal Equation

Selfie - Copy

Regularized Logistic Regression

2017-06-17_212015

Week 4 - Neural Networks

Neural Networks

Model Representation I

a(j)i="activation" of unit i in layer j
Θ(j)=matrix of weights controlling function mapping from layer j to layer j+1

#Link Title

Model Representation II

Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".

Link to Coursera

Application

Examples and Intuitions

Part 1
Selfie - Copy
Where g(z) is the following:

Selfie - Copy

Multiclass Classification

2017-07-14_163934

Week 5 - Neural Networks: Learning ⭐

Cost Function and Backpropagation

Backpropagation in Practice

Application of Neural Networks

Cost Function

Backpropagation Algorithm

❓

Backpropagation Intuition [BEST]

Link

Implementation Note: Unrolling Parameters

Link

Gradient Checking

Random Initialization

Selfie - Copy

Put It Together ⭐

Link

Week 6 - Advice for Applying Machine Learning

Evaluating a Learning Algorithm

Bias vs. Variance

Evaluating a Hypothesis

Link

Model Selection and Train/Validation/Test Sets

Diagnosing Bias vs. Variance

Learning Curves

Deciding What to Do Next Revisited

Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.

Building a Spam Classifier

Handling Skewed Data

Using Large Data Sets

click to edit

Week 7 -Support Vector Machines

Large Margin Classification

Kernels

SVMs in Practice

Week 8 -Unsupervised Learning

Clustering

Dimensionality Reduction

Principal Component Analysis

Applying PCA

Week 9

Anomaly Detection

Recommender Systems

Density Estimation

Building an Anomaly Detection System

Multivariate Gaussian Distribution (Optional)

Predicting Movie Ratings

Collaborative Filtering

Low Rank Matrix Factorization

Week 10 - Large Scale Machine Learning

Gradient Descent with Large Datasets

Advanced Topics

Stochastic Gradient Descent Convergence

Week 11 - Application Example: Photo OCR

1. Normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range

2. Compute the covariance matrix of the data

3. run SVD on it to compute the principal components

[U, S, V] = svd(Sigma), where U will contain the principal components and S will contain a diagonal matrix.

Learning With Large Datasets

Stochastic Gradient Descent

Mini-Batch Gradient Descent

click to edit