Statistical Learning
Learning models
Probabilities and statistics
Supervised learning
Decision trees
Neural networks
Linear Regression
Bayesian statistics
De Morgan's laws
Introduction to Bayesian statistics
Introduction to Statistical learning
Support Vector Machines
Support Vector machines allow to split the data into two regions by use of a hyperplane.
Effective in High dimensional spaces.
Trained on a small subset of data points, meaning it is robust against new points. This is called being memory efficient.
Multiclass classification is possible here thanks to the 1v1 or 1vall strategies
1v1: this implements a set of binary classifiers between all possible pairs of classes
onevall: this implements a bunch of classifiers which distinguish between one class and all the rest.
Complexity, the scikit learn complexity scales as n_features x n_samples^2 or n_features x n_samples^3. Depending on the data. So, the more samples the longer it takes to train.
Resources
Scitkit-learn user guides: http://scikit-learn.org/stable/user_guide.html
Scikit-learn tutorials: http://scikit-learn.org/stable/tutorial/index.html
Especially, unsupervised learning: http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html
Scikit-learn guide chart to classifiers: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Nearest neighbors
The problem with these is that sometimes the data set cannot be perfectly split into two regions. In this case, we can define approximations to a hyperplane, for instance by having a soft edge, or allow some points to be mis-classified. Ultimately, we want a model that is robust for most points. It's ok if some point are mis-classified as long as it holds well for most points.
The basic principle behind nearest neighbors is to look at the points closest to the data point we are interested in, and use those points to determine the outcome of our data point. The number of points is a use-chosen parameter in the K-nearest-neighbor algorithm.
Parametric models are those that assume a specific form for the model, non-parametric models do not assume a specific form for f(X).
Non-parametric models generally perform better because they are not data-specific and are more flexible.
To determine the value of the prediction point, it takes the nearest neighbours and averages their value.
At higher dimensions, there is an effectively smaller sample size. Thus, when the number of features are high the performance of a KNN drops dramatically.
Decision trees are non-parametric, but they are not stable against new data: new data can lead to overly different tree.
This is mitigated by having ensemble methods, like a random forest, where you have several trees and the effect of new data is minimal.
Decision trees are great because they are easy to read, they are a white box model: the results can easily be explained by boolean logic.
DTs are bad because they can easily be overfit if we use too many leafs, or too great a depth.
It is possible to vizualize a tree using the grapicz package of scikit-learn!
Decision trees work by splitting the data into yes/no categories, and assessing the prediction data by averaging the values in that category. We can keep splitting the data until we have reached the tree depth and the number of leaves.
Ensemble learning
Combine simple algorithms in ensembles to reduce variance or bias, in a word, to make better estimators.
Bagging
Boosting
Create a set of estimators trained on random subsets of the training sets.
The advantage is to reduce variance
Different names depending on whether the random subsets are features, or samples, or both.
Ruder blog,
http://ruder.io/optimizing-gradient-descent/
click to edit
Text classification with TensorFlow
Vizualizing a network, http://scs.ryerson.ca/~aharley/vis/conv/flat.html
In boosting, each data sample is given a different weight and you train a whole bunch of classifiers to train on them. Initially, all the weights are the same, but as time goes you reduce the weights of the data that was correctly predicted at the previous step, while you increase the weights of the data that was incorrectly predicted. In that way, you force the estimators to pay more attention to the mis-classified data. This improves the performance.