Statistical Learning

Learning models

Probabilities and statistics

Supervised learning

Decision trees

Neural networks

Linear Regression

Bayesian statistics

De Morgan's laws

Introduction to Bayesian statistics

Introduction to Statistical learning

Support Vector Machines

Support Vector machines allow to split the data into two regions by use of a hyperplane.

http://scikit-learn.org/stable/modules/svm.html#svm-kernels

Effective in High dimensional spaces.

Trained on a small subset of data points, meaning it is robust against new points. This is called being memory efficient.

Multiclass classification is possible here thanks to the 1v1 or 1vall strategies

1v1: this implements a set of binary classifiers between all possible pairs of classes

onevall: this implements a bunch of classifiers which distinguish between one class and all the rest.

Complexity, the scikit learn complexity scales as n_features x n_samples^2 or n_features x n_samples^3. Depending on the data. So, the more samples the longer it takes to train.

Resources

Scitkit-learn user guides: http://scikit-learn.org/stable/user_guide.html

Scikit-learn tutorials: http://scikit-learn.org/stable/tutorial/index.html

Especially, unsupervised learning: http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

Scikit-learn guide chart to classifiers: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Nearest neighbors

The problem with these is that sometimes the data set cannot be perfectly split into two regions. In this case, we can define approximations to a hyperplane, for instance by having a soft edge, or allow some points to be mis-classified. Ultimately, we want a model that is robust for most points. It's ok if some point are mis-classified as long as it holds well for most points.

The basic principle behind nearest neighbors is to look at the points closest to the data point we are interested in, and use those points to determine the outcome of our data point. The number of points is a use-chosen parameter in the K-nearest-neighbor algorithm.

http://scikit-learn.org/stable/_static/ml_map.png

Parametric models are those that assume a specific form for the model, non-parametric models do not assume a specific form for f(X).

Non-parametric models generally perform better because they are not data-specific and are more flexible.

To determine the value of the prediction point, it takes the nearest neighbours and averages their value.

At higher dimensions, there is an effectively smaller sample size. Thus, when the number of features are high the performance of a KNN drops dramatically.

Decision trees are non-parametric, but they are not stable against new data: new data can lead to overly different tree.

This is mitigated by having ensemble methods, like a random forest, where you have several trees and the effect of new data is minimal.

Decision trees are great because they are easy to read, they are a white box model: the results can easily be explained by boolean logic.

DTs are bad because they can easily be overfit if we use too many leafs, or too great a depth.

It is possible to vizualize a tree using the grapicz package of scikit-learn!

Decision trees work by splitting the data into yes/no categories, and assessing the prediction data by averaging the values in that category. We can keep splitting the data until we have reached the tree depth and the number of leaves.

Ensemble learning

Combine simple algorithms in ensembles to reduce variance or bias, in a word, to make better estimators.

Bagging

Boosting

Create a set of estimators trained on random subsets of the training sets.

The advantage is to reduce variance

Different names depending on whether the random subsets are features, or samples, or both.

Ruder blog,
http://ruder.io/optimizing-gradient-descent/

http://playground.tensorflow.org/#activation=tanh&batchSize=29&dataset=circle&regDataset=reg-plane&learningRate=0.00001&regularizationRate=0&noise=5&networkShape=6,4,3,3,2&seed=0.31335&showTestData=true&discretize=false&percTrainData=80&x=true&y=true&xTimesY=true&xSquared=true&ySquared=true&cosX=false&sinX=true&cosY=false&sinY=true&collectStats=false&problem=classification&initZero=false&hideText=false

click to edit

Text classification with TensorFlow

Vizualizing a network, http://scs.ryerson.ca/~aharley/vis/conv/flat.html

In boosting, each data sample is given a different weight and you train a whole bunch of classifiers to train on them. Initially, all the weights are the same, but as time goes you reduce the weights of the data that was correctly predicted at the previous step, while you increase the weights of the data that was incorrectly predicted. In that way, you force the estimators to pay more attention to the mis-classified data. This improves the performance.