Please enable JavaScript.
Coggle requires JavaScript to display documents.
Machine Learning Introduction - Coggle Diagram
Machine Learning Introduction
Machine Learning Introduction.md
Concepts
Machine learning is the process of discovering patterns in existing data to make a prediction.
In machine learning, a feature is an individual measurable characteristic.
When predicting a continuous value, the main similarity metric that's used is Euclidean distance.
K-nearest neighbors computes the Euclidean Distance to find similarity and average to predict an
unseen value.
Let to represent the feature values for one observation, and to represent the feature
values for the other observation then the formula for Euclidean distance is as follows: (see on md)
In the case of one feature (univariate case), the Euclidean distance formula is as follows: (see on md)
Evaluating Model Performance
A machine learning model outputs a prediction based on the input to the model
When you're beginning to implement a machine learning model, you'll want to have some kind of validation to ensure your machine learning model can make accurate prediction on new data.
To quantify how good the predictions are for the test set, you would use an error metric. The error metric quantifies the difference between each prediction and actual value and the averaging those differences
This is know as the mean error but isn't effective in most cases because positive and negative difference are treated differently
The
MAE
computes the absolute value of each error before we average all the errors.
The
MSE
Makes the gap between the predicted and actual values more clear by squaring the difference of the two values.
The
RMSE
is a error metric whose units are the base unit.
In general, the
MAE
value is expected to be much less than the
RMSE
value due to the sum of the squared difference before averaging.
Multivariate K-Nearest Neighbors
To reduce the
RMSE
value during validation and improve accuracy, we can:
Select the relevant attributes a model uses. When selecting attributes, you want to make sure you're not working with a column that doesn't have continuous values. The process of selecting features to use in a model is known as
feature selection
.
Increase the value of
k
in our algorithm
We can normalize the columns to prevent any single value from having too much of an impact on distance Normalizing the values to a standard normal distribution preserves the distribution while aligning the scales.
The
distance.euclidean()
from
scipy.spatial
expects:
Both of the vectors to be represent using a list-like object (Python list, numpy array, or pandas series).
Both of the vectors must be 1-dimensional and have the same number of elements.
The
scikit-learn
library is the most popular machine learning library in Python. Scikit-learn contains functions for all of the major machine learning algorithms implemented as a separate class. The workflow consists of four main steps:
Instantiate the specific machine learning model that we want to use
fit the model to the training data
use the model to make predictions
Evaluate the accuracy of the predcitions
One main class of machine learning models is known as a regression model, which predicts numerical value. The other main class of machine learning models is called classification, which is used when we're trying to predict a label from a fixed set of labels.
Hyperparameter Optimization
Hyperparameters are values that affect the behavior and performance of a model that are unrelated to the data. Hyperparameter optimization is the process of finding the optimal hyperparameter value.
Grid search is a simple but common hyperparameter optimization technique, which involves evaluating the model performance at different k values and selecting the k value that resulted in the lowest error. Grid search involves
Selecting a subset of the possible hyperparameter values.
Training a model using each of these hyperparameter values.
Evaluating each model's performance.
Selecting the hyperparameter value that resulted in the lowest error value.
The general workflow for finding the best model is:
Selecting relevant features to use for predicting the target column.
Using grid search to find the optimal hyperparameter value for the selected features.
Evaluate the model's accuracy and repeat the process.
Cross Validation
Holdout validation is a more robust technique for testing a machine learning model's accuracy on new data the model wasn't trained on. Holdout validation involves:
Splitting the full data set into two partitions:
A training set.
A test set.
Training the model on the training set.
Using the trained model to predict labels on the test set.
Computing an error to understand the model's effectiveness.
Switching the training and test sets and repeat.
Averaging the errors.
In holdout validation, we use a 50/50 split instead of the 75/25 split from train/test validation to
eliminate any sort of bias towards a specific subset of data.
Holdout validation is a specific example of k-fold cross-validation, which takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data, when k is set to two.
Holdout validation is a specific example of k-fold
cross-validation
k-fold cross-validation takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data when k is set to two.
Splitting the full data set into k equal length partitions:
Selecting k-1 partitions as the training set.
Selecting the remaining partition as the test set.
Training the model on the training set.
Using the trained model to predict labels on the test fold.
Computing the test fold's error metric.
Repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration.
Calculating the mean of the k error values.
The parameters for the KFold class are:
n_splits
: The number of folds you want to use.
shuffle
: Toggle shuffling of the ordering of the observations in the data set.
random_state
: Specify the random seed value if shuffle is set to True .
The parameters for using cross_val_score are:
estimator
: Scikit-learn model that implements the fit method (e.g. instance of
KNeighborsRegressor)
X
: The list or 2D array containing the features you want to train on.
y
: A list containing the values you want to predict (target column).
scoring
: A string describing the scoring criteria.
cv
: The number of folds:
The workflow for k-fold cross-validation with scikit-learn includes:
Instantiating the scikit-learn model class you want to fit.
Instantiating the KFold class and using the parameters to specify the k-fold crossvalidation
attributes you want.
Using the cross_val_score() function to return the scoring metric you're interested in.
Bias
describes error that results in bad assumptions about the learning algorithm.
Variance
describes error that occurs because of the variability of a model's predicted value. In an ideal world, we want low bias and low variance when creating machine learning models.