Chapter 13 - Linear Regression - Coggle Diagram
Chapter 13 - Linear Regression
1. Fitting a line
Assumes that the relationship between the features and the target vector is approximately linear
That is, the effect (also called coefficient weight or parameter) of features on the target vector is constant.
yhat = Bhat0 + Bhat1 x x1 + Bhat2 x x2 + epsilon
Major advantage is its interoperability, because the coefficients of the model are the effect of a one-unit change on the target vector
coeff is -349 ? means that when the respective x go up the target does minus 349 example given crime per capita on house price
2. Handling Interactive Effects
You have a feature whose effect on the target variable depends on another feature.
Create interaction term namely features_interaction = using sklearn's PolynomialFeatures(d=3, bias=f, interact=t).fit_tans(features).
yhat = Bhat0 + Bhat1.x1 + Bhat2.x2 + Bhat3.x1.x2 + epsilon
features (have_sugar=0,1 and is_stirred=0,1), Target is_sweet = 0,1
if have_sugar = 0 and is_stirred =1, the coffee will not be sweet
but if have_sugar=1 and is_stirred=1, the coffee will be sweet, hence the feature interaction!
3 Important features:
interaction_only = True, only return interaction terms without polynomial features (13.3).
bias=f, not returning 1s biases.
d=3, check for the interaction of features in a maximum of triples of features (x1.x2.x3)
3. Fitting a Nonlinear Relationship
I want to model a nonlinear relationship.
PolynomialFeatures(d=3, bias=false), returns x^2, x^3
The number of stories a building has and the building's height, 20 stories building will have twice the high of a 10 stories building,10 stories building is twice high as 5 stories buildings
Hours studied by student and final score,studying a 99 hours and 100 hours will not create a big different on the test, 0-10 hours diff impact of 90-100 hours.
yhat = Bhat0 + Bhat1.x1 + Bhat2.x1^2+ Bhatd.xi^d + epsilon
d is the degree of polynomial, can we fit nonlinear features to linear model? yes because we dot not change hot the model fits the features, it is another feature the model does not know that x^2 is a quadratic of x.
Hence is a linear model, our line will be more "flexible" when it fit nonlinear features, x1, x1^2, x1^3 etc..
4. Reducing Variance with Regularization
You want to reduce the variance of your liner regression model.
Use a learning algorithm that includes a shrinkage penalty (also called Regularization) like
Ridge Regression and Lasso Regression
Rigde reg: regression = Ridge(alpha=0.5)
standard linear regression
, the model trains to minimize the sum of squared error between the true (yi) and prediction (yhat,i),
learners are similar except they attempt to minimize the RSS and some penalty.
Two (2) Common types of regularized learners for linear regression
: RSS + alpha . SUM(Bhat^2,j) / j=1 --> p.
Bhat,j is the coefficient of the jth of p features, and alpha is a tunable hyper-parameter (discussed next).
: (1/2n).RSS + alpha . SUM(abs(Bhat,j)) / j=1 --> p.
n is the number of observations.
Bhat,j and alpha has the same definition above.
Which one should we use?
Rule of thumb
better predictions than Lasso
produces more interpretable models, (See 13.5 for reason)
: We can also balance between ridge and lasso's penalty functions we can
which is simply a regression model with both penalties included.
Regardless of which one we choose any of the two can penalize complex models by including coefficient values in the loss function were are trying to minimize.
, the higher the simpler the model, means higher bias by penalizing penalizing higher coefficients.
to get the best alpha.
5. Reducing Features with Lasso Regression
You want to simplify your linear regression model by reducing the number of features.
One interesting characteristic of Lasso regression's penalty is it can shrink the coefficients of a model to zero, which reduces the number of features in the model.
here - after fitting the model we see that some coef are 0s, if alpha is 10 all coeffs are 0.
Fewer features means easier interpretability