Please enable JavaScript.
Coggle requires JavaScript to display documents.
MACHINE LEARNING COURSERA (Week 6 - Advice for Applying Machine Learning…
MACHINE LEARNING COURSERA
Week 1
Introduction
Supervised Learning
Unsupervised Learning
Model and Cost Function
Model Presentation
Cost Funtion
Keep 2 parameters
A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line.
== Squared Error Function
Why 1/2 ?
The '1/2' portion is a calculus trick, so that it will cancel with the '2' which appears in the numerator when we compute the partial derivatives. This saves us a computation in the cost function.
Eliminate 1 paramater
Our objective is to get the best possible line. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0
Parameter Learning
Gradient Descent
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That's where gradient descent comes in.
At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.
Simplize with 1 parameter
Gradient Descent For Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :
==> batch gradient descent
Linear Argebra Review
Matrix Vector Multiplication
Matrix Vector Multiplication
TRICK
Matrix Matrix Multiplication
Matrix Matrix Multiplication
TRICK
Matrix Multiplication Properties
matrix matrix multiplication is
not commutative
matrix matrix multiplication is
associative
Transpose
Inverse
Week2
Multivariate Linear Regression
Multiple Feature
Radient Decense for Multiple Variable
Feature Scaling
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.
Learning Rate
Polyomial Regression
We can improve our features and the form of our hypothesis function in a couple different ways.
We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Computing Parameter Analyticaly
Normal Equation
Normal Equation Uninvertibility
Week 3 - Logistic Regression
Classification and Representation
Hypothesis Representation
Sigmoid Function," also called the "Logistic Function"
Decision Boundary
Logistic Regression Model
Cost function
Simplified Cost Function and Gradient Descent
Simplified Cost Function
Link Title
Giống Linear Regression, chỉ khác ở h(x) thôi
Multiclass Classification
ONE-VS_ALL
Solving the Problem of Overfitting
The Problem of Overfitting
There are two main options to address the issue of overfitting:
1) Reduce the number of features:
Manually select which features to keep.
Use a model selection algorithm (studied later in the course).
2) Regularization
Keep all the features, but reduce the magnitude of parameters θj.
Regularization works well when we have a lot of slightly useful features.
Regularized Linear Regression
Gradient Descent
Normal Equation
Regularized Logistic Regression
Week 4 - Neural Networks
Neural Networks
Model Representation I
a(j)i="activation" of unit i in layer j
Θ(j)=matrix of weights controlling function mapping from layer j to layer j+1
#Link Title
Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".
Model Representation II
Link to Coursera
Application
Examples and Intuitions
Part 1
Where g(z) is the following:
Multiclass Classification
Week 5 - Neural Networks: Learning :star:
Cost Function and Backpropagation
Cost Function
Backpropagation Algorithm
:question:
Backpropagation Intuition [BEST]
Link
Backpropagation in Practice
Implementation Note: Unrolling Parameters
Link
Gradient Checking
Random Initialization
Put It Together :star:
Link
Application of Neural Networks
Week 6 - Advice for Applying Machine Learning
Evaluating a Learning Algorithm
Evaluating a Hypothesis
Link
Model Selection and Train/Validation/Test Sets
Bias vs. Variance
Diagnosing Bias vs. Variance
Learning Curves
Deciding What to Do Next Revisited
Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.
Building a Spam Classifier
Handling Skewed Data
Using Large Data Sets
Week 7 -Support Vector Machines
Large Margin Classification
Kernels
SVMs in Practice
Week 8 -Unsupervised Learning
Clustering
Dimensionality Reduction
Principal Component Analysis
1. Normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range
2. Compute the covariance matrix of the data
3. run SVD on it to compute the principal components
[U, S, V] = svd(Sigma), where U will contain the principal components and S will contain a diagonal matrix.
Applying PCA
Week 9
Anomaly Detection
Density Estimation
Building an Anomaly Detection System
Multivariate Gaussian Distribution (Optional)
Recommender Systems
Predicting Movie Ratings
Collaborative Filtering
Low Rank Matrix Factorization
Week 10 - Large Scale Machine Learning
Gradient Descent with Large Datasets
Stochastic Gradient Descent Convergence
Learning With Large Datasets
Stochastic Gradient Descent
Mini-Batch Gradient Descent
Advanced Topics
Week 11 - Application Example: Photo OCR