Some take out:

- If $p \ge n$, least square method may have high variance, hence we use shrinkage method which can reduce variance.

## 1. Problem with linear regression with least square (ls)

- Prediction Accuracy: linear regression has
**low bias**but suffer from**high variance**, especially when $n \approx p$. It cannot handle $n<p$. - Model Interpretability: It is often the case that some or many of the variables used in a multiple regression model are
**in fact not associated with the response**. Including such irrelevant variables leads to unnecessary complexity in the resulting model.

## 2. Selected alternatives to LS

**Subset Selection**. This approach involves identifying**a subset of the $p$ predictors**that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.**Shrinkage**. This approach involves fitting a model involving all $p$ predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of**reducing variance.**Depending on what type of shrinkage is performed,**some of the coefficients may be estimated to be exactly zero**. Hence, shrinkage methods can also perform variable selection.- Ridge Regression (Norm 2)
- Lasso (Norm 1) can estimate coefficients to zero

**Dimension Reduction**. This approach involves projecting the $p$ predictors into a $M$-dimensional subspace, where $M<\mathrm{p}$. This is achieved by computing $M$ different linear combinations, or projections, of the variables. Then these $M$ projections are used as predictors to fit a linear regression model by least squares**Best subset selection**- Let $\mathcal{M}_{0}$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

- For $k=1,2, \ldots p$ : (a) Fit all $\left(\begin{array}{l}p \ k\end{array}\right)$ models that contain exactly $k$ predictors. (b) Pick the best among these $\left(\begin{array}{l}p \ k\end{array}\right)$ models, and call it $\mathcal{M}_{k}$. Here best is defined as having the smallest RSS, or equivalently largest $R^{2}$.
- Select a single best model from among $\mathcal{M}
*{0}, \ldots, \mathcal{M}*{p}$ using crossvalidated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^{2}$.

**Forward stepwise selection**and**Backward stepwise selection**- Forward stepwise selection (can be used even $n \le p$)
- Let $\mathcal{M}_{0}$ denote the null model, which contains no predictors.

- For $k=0, \ldots, p-1$ :
(a) Consider all $p-k$ models that augment the predictors in $\mathcal{M}
*{k}$ with one additional predictor. (b) Choose the best among these $p-k$ models, and call it $\mathcal{M}*{k+1}$. Here best is defined as having smallest RSS or highest $R^{2}$. - Select a single best model from among $\mathcal{M}
*{0}, \ldots, \mathcal{M}*{p}$ using crossvalidated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^{2}$.

- Backward stepwise selection (require $n \ge p$)
- Let $\mathcal{M}_{p}$ denote the full model, which contains all $p$ predictors.
- For $k=p, p-1, \ldots, 1$ :
(a) Consider all $k$ models that contain all but one of the predictors in $\mathcal{M}
*{k}$, for a total of $k-1$ predictors. (b) Choose the best among these $k$ models, and call it $\mathcal{M}*{k-1}$. Here best is defined as having smallest RSS or highest $R^{2}$. - Select a single best model from among $\mathcal{M}
*{0}, \ldots, \mathcal{M}*{p}$ using crossvalidated prediction error, $C_{p}$ (AIC), BIC, or adjusted $R^{2}$.

- $$\begin{aligned} &C_{p}=\frac{1}{n}\left(\mathrm{RSS}+2 d \hat{\sigma}^{2}\right) \ &\mathrm{AIC}=\frac{1}{n \hat{\sigma}^{2}}\left(\mathrm{RSS}+2 d \hat{\sigma}^{2}\right) \ &\mathrm{BIC}=\frac{1}{n}\left(\mathrm{RSS}+\log (n) d \hat{\sigma}^{2}\right) \ &\text { Adjusted } R^{2}=1-\frac{\mathrm{RSS} /(n-d-1)}{\mathrm{TSS} /(n-1)} \end{aligned}$$

- Forward stepwise selection (can be used even $n \le p$)

## 3. Shrinkage in details

### Ridge regression

Ridge regression is very similar to least squares, except that the **coefficients are estimated by minimizing a slightly different quantity**. In particular, the ridge regression coefficient estimates $\beta^{R}$ are the values that minimize
$$
\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} x_{i j}\right)^{2}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}=\mathrm{RSS}+\lambda \sum_{j=1}^{p} \beta_{j}^{2}
$$
where $\lambda \geq 0$ is a **tuning parameter**, to be determined separately. The above equation trades off two different criteria. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the $RSS$ small. However, the second term, $\lambda \sum_{j} \beta_{j}^{2}$, called a **shrinkage penalty**, is small when $\beta_{1}, \ldots, \beta_{p}$ are close to zero, and so it has the effect of shrinking penalty the estimates of $\beta_{j}$ towards zero.

Unlike least squares, which generates **only one set** of coefficient estimates, ridge regression will produce a different set of coefficient estimates, $\hat{\beta}_{\lambda}^{R}$, for each value of $\lambda$. **Selecting a good value for $\lambda$ is critical.**

We want to shrink the estimated association of each variable with the response; however, **we do not want to shrink the intercept**, which is simply a measure of the mean value of the response when $x_{i 1}=x_{i 2}=\ldots=x_{i p}=0$. If we assume that the variables-that is, the columns of the data matrix $X$-have been centered to have mean zero before ridge regression is performed, then the estimated intercept will take the form $\hat{\beta}_{0}=\bar{y}$.
The shrinkage penalty is not **scale invariant**. Therefore, it is best to **apply ridge regression after standardizing the predictors.**

**$\lambda$ Increases will lead to decrease of flexibility hence lower variance and higher bias.**

In general, in situations where the relationship between the response and the predictors is **close to linear**, the least squares estimates will have **low bias but may have high variance**. **This means that a small change in the training data can cause a large change in the least squares coefficient estimates.** In particular, when the number of variables $p$ is almost as large as the number of observations $n$, the least squares estimates will be extremely variable. **And if $p>n$, then the least squares estimates do not even have a unique solution,** whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. **Hence, ridge regression works best in situations where the least squares estimates have high variance.**

### Ridge regression compared to subset selection

Ridge regression also has **substantial computational advantages** over best subset selection, which requires searching through $2^{p}$ models. As we discussed previously, even for moderate values of $p$, such a search can be computationally infeasible. In contrast, for any fixed value of $\lambda$, ridge regression only fits a single model, and the model-fitting procedure can be performed quite quickly. In fact, one can show that the computations required to solve the penalized least square, simultaneously for all values of $\lambda$, are almost identical to those for fitting a model using least squares.