Chapter 35 Regularization

Regularization is a technique to discourage learning complex models, thus helping to avoid overfitting. Idea is to shrink the coefficient estimates towards zero.

Three major types of regularization:

L1 regularization (lasso regression)
L2 regularization (ridge regression)
Elastic net (combines lasso and ridge regression)

Ridge regression adds fa shrinkage term to the RSS (Residual sum of squares) objective function.

\[RSS + ||\overrightarrow{\beta}||_2 = RSS + \lambda \sum_j \beta_j^2\]

Shrinkage term uses the L2 norm of the coefficient vector.
$\lambda$ regularization parameter (how much should model complexity be penalized?) $\labda = 0$ - original RSS function
Need to standardize regressors before applying ridge regression

Lasso regression LASSO = Least Absolute Shrinkage and Selection Operator

\[RSS + ||\overrightarrow{\beta}||_1 = RSS + \lambda \sum_j |\beta_j|\]

Shrinkage term uses the L1 norm of the coefficient vector.
Penalizes large coefficients more severely.
More coefficients are likely becoming zero.

Elastic net

\[RSS + \lambda _1 ||\overrightarrow{\beta}||_1 + \lambda _2 \overrightarrow{\beta} ||_2\]

Special cases:
- $\lambda _1 = lambda, \lambda _2 = 0$ - lasso regression
- $_1 = 0, _2 = - ridge regression
- $\lambda _1 = lambda _2 = 0$ - ordinary least squares (OLS)

Boss lasso and ridge regression:

Both methods improve generalization by penalizing model complexity.
Their computational complexity is quite similar.
Penalization hyperparameter $\lambda$ must be carefully set.

Differences between lasso and ridge regressions:

Ridge regression shrinks large coefficients but does not perform feature selection.
Lasso regression performs both shrinkage and selection.
L1 norm turns some coefficients to zero.
Produces a more interpretable model.