Please enable JavaScript.
Coggle requires JavaScript to display documents.
Linear Regression For Machine Learning - Coggle Diagram
Linear Regression For Machine Learning
Linear Regression Model
An Instance -based learning algorithm, such as K-nearest neighbors, relies completely on previous instacces to make predictions. K-nearest neighbors doesn't try to understand or capture the relationship between the feature columns and the target column.
Parametric machine learning, like linear regression and logistic regression, results in a mathematical function that best approximates the patterns in the training set. In machine learning, this function is often referred to as model. Parametric machine learning approaches work by making assumptions about the relationship between the feature and the target column.
The following equation is the general form of the simple linear regression model.
^y=a1x1 +a0
Where ^y represents the target column while x1 represents the feature column we chose to use in our model. a0 and a1 represent the parameter values that are specific to the dataset.
The goal of a simple linear regression is to find the optimal parameter values that best describe the relationship between the feature column and the target column.
We minimize the model's residual sum of squares to find the optimal parameters for a linear regression model. The equation for the model's residual sum of squares is as follows
RSS = (y1-^y1)^2 +(y2-^y2)^2+...+(yn-^yn)^2
where ^yn is our target column and y are our true values.
A multiple linear regression model allow us to capture the relationship between multiple features columns and the target column.
In linear regression, it is a good idea to select features that are good prediction of the target column.
Feature Selection
Once we select the model we want to use, selection the appropriate features for that model is the next important step. When selection features, you'll want to consider:
correlation between features and the target column, correlation with other features and the variance of the features.
Along with correlation with other features, we need to also look for the potential collinearity between some of the feature columns.
Collinearity is when two features are highly correlated and have the risk of duplication information.
We can generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations.
Feature scaling helps ensure that some columns aren't weighted more than other when helping the model make predictions. We can rescale all of the columns to vary between 0 and 1. This known as min-max scaling or rescaling. The formula for rescaling is as follows
(x-min(x)
/
(max(x)-min(x))
Gradient Descent
The process of finding the optimal unique parameter values to form a unique linear regression model is known as model fitting. The foal of model fitting is to minimize the mean squared error.
MSE = (1/n)SUM((^yi-yi)2)
Gradient descent is a iterative technique for minimizing the squared error. Gradient descent works by trying different parameter values until the model with the lowest mean squared error is found. Gradient descent is commonly used optimization technique for other models as well.
Select initial values for the parameter a1.
Repeat until converge(usually implemented with a max number of iterations)
Calculate the error (MSE) of the model that uses current parameter value.
Calculate the derivative of the error (MSE) at the current parameter value:
Update the parameter value by subtracting the derivative by a constant(alpha, called learning rate)
Univariate case of gradient descent
The function that we optimize through minimization is known as a cost function or as the
loss function. In our case, the loss function is:
MSE(a1) = ( 1/n Σ y^(i) y(i))2
Applying calculus properties to simplify the derivative of the loss function
Applying the linearity of differentiation property, we can bring the constant term outside the summation
Using the power rule and the chain rule to simplify
Because we're differentiating with respect to a1, we treat y^(i) and x1^(i) as constants
For every iteration of gradient descent
The derivative is computed using the current a1 value.
The derivative is multiplied by the learning rate. The result is subtracted from the current parameter value and assigned as the new parameter value
Multivariate case of gradient descent
When we have two parameter values (a0 and a1), the cost function is now a function of two variables instead of one.
We also need two update rules
Computed derivative for the multivariate case
Gradient descent scales to as many variables as you want. Keep in mind each parameter value will need its own update rule, and it closely matches the update for . The derivative for other parameters are also identical.
Choosing good initial parameter values, and choosing a good learning rate are the main challenges with gradient descent.
Ordinary Least Squares(OLS) Estimation
The ordinary least squares estimation provides a clear formula for directly calculating the optimal values that minimize the cost function.
The OLS estimation formula that results in optimal vector a (see on md)
OLS estimation provides a closed form solution to the problem of finding the optimal parameter values. A closed form solution is where a solution can be computed arithmetically with a predictable amount of mathematical operations.
The error for OLS estimation is often represented using the Greek letter for E. Since the error is the difference between the predictions made using the model and the actual labels, it's represented as a vector:
E = ^y -y
The biggest limitation of OLS estimation is that it's computationally expensive when the data is large. Computing the inverse of a matrix has a computational complexity of approximately O(n^3)
OLS is computationally expensive, and so is commonly used when the numbers of elements in the dataset is less than a few million elements.
Processing And Transforming Features
Feature engineering is the process of processing and creating new features. Feature engineering is a bit of an art and having knowledge in the specific domain can help create better features.
Categorical features are features that can take on one of a limited number of possible values.
A drawback to converting a column to the categorical data type is that one of the assumptions of linear regression is violated. Linear regression operates under the assumption that the features are linearly correlated with the target column
Instead of converting to the categorical data type, it's common to use a technique called dummy coding. In dummy coding, a dummy variable is used. A dummy variable that takes the value of or to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
When values are missing in a column, there are two main approaches we can take:
Removing rows that contain missing values for specific columns:
Pro: Rows containing missing values are removed, leaving only clean data for modeling.
Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy.
Imputing (or replacing) missing values using a descriptive statistic from the column:
Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model.
Con: Depending on the approach, we may be adding noisy data for the model to learn.