Chapter 35 Regularization
Regularization is a technique to discourage learning complex models, thus helping to avoid overfitting. Idea is to shrink the coefficient estimates towards zero.
Three major types of regularization:
- L1 regularization (lasso regression)
- L2 regularization (ridge regression)
- Elastic net (combines lasso and ridge regression)
Ridge regression adds fa shrinkage term to the RSS (Residual sum of squares) objective function.
\[RSS + ||\overrightarrow{\beta}||_2 = RSS + \lambda \sum_j \beta_j^2\]
- Shrinkage term uses the L2 norm of the coefficient vector.
- \(\lambda\) regularization parameter (how much should model complexity be penalized?)
\(\labda = 0\) - original RSS function
- Need to standardize regressors before applying ridge regression
Lasso regression LASSO = Least Absolute Shrinkage and Selection Operator
\[RSS + ||\overrightarrow{\beta}||_1 = RSS + \lambda \sum_j |\beta_j|\]
- Shrinkage term uses the L1 norm of the coefficient vector.
- Penalizes large coefficients more severely.
- More coefficients are likely becoming zero.
Elastic net
\[RSS + \lambda _1 ||\overrightarrow{\beta}||_1 + \lambda _2 \overrightarrow{\beta} ||_2\]
Special cases:
- \(\lambda _1 = lambda, \lambda _2 = 0\) - lasso regression
- $_1 = 0, _2 = - ridge regression
- \(\lambda _1 = lambda _2 = 0\) - ordinary least squares (OLS)
Boss lasso and ridge regression:
- Both methods improve generalization by penalizing model complexity.
- Their computational complexity is quite similar.
- Penalization hyperparameter \(\lambda\) must be carefully set.
Differences between lasso and ridge regressions:
- Ridge regression shrinks large coefficients but does not perform feature selection.
- Lasso regression performs both shrinkage and selection.
- L1 norm turns some coefficients to zero.
- Produces a more interpretable model.