Machine Learning
Supervised learning & Unsupervised learning
Starting Point
Outcome measurement $Y$ (Also dependent variable, response, target )
 In the regression problem, $Y$ is quantitative.
 In the classification problem, $Y$ takes values in a finite, unordered set.
Vector of $p$ predictor measurements $X$ (also called inputs, regressors, covariates, features, independent variables)
Unsupervised Learning
Starting Point
 No outcome varibale, just a set of predictors (features) measured on a set of samples.
 Objective is more fuzzy  find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
 Difficult to know how well you are doing.
 Different from supervised learning, but can be useful as a preprocessing step for supervised learning or as an exploratory analysis tool
Our objectives
 Accurately predict unseen test cases.
 Understand which inputs affect the outcome, and how.
 Access the quality of our predictions and inferences.
ML is to generalize knowledge beyond the training examples
Philosophy

It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

One has to understand the simpler methods first, in order to grasp the more sophisticated ones.

It is important to accurately assess the performance of a method, to know how well or how badly it is working [simple methods often perform as well as fancier ones!]
Supervised learning : regression problems
Find feature and response
$X$(Independent variable, feature, covariate, input): TV, Radio, Newspaper
$Y$(Dependent variable, target, response, output): Sales
We try to build a model: $$ \text{Sales} \approx f(\text{TV, Radio, Newspaper}) $$ We can refer to the input vector collectively as: $$ X=\left(\begin{array}{l} X_{1} \ X_{2} \ X_{3} \end{array}\right) $$ Now we can write our model as: $$ Y=f(x) + \varepsilon $$ $\varepsilon$ captures measurement errors and other discrepancies.
What is regression function?
The ideal $f(x)=E(YX=x)$ Is called the regression function.
What is our goal?
$f(x)$ is optimal predictor of $Y$ with regard to meansquared prediction error $$ \text{Minimize}:E\left[(Yg(X))^{2} \mid X=x\right] $$
How to estimate $f$ ?
Typically we have few if any data points with $X=4$ exactly, so we cannot compute $E(YX=x)$ !
What we do is to relax the definition and let:
$$ \hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x)) $$ where $\mathcal{N}(x)$ is some neighborhood of $x$.
Build linear model
$$ f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p} $$

A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1,…,\beta_p$

We estimate the parameters by fitting the model to training data

Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$
Interpretability and Flexibility
 Why underfitting is bad?
 If a model is underfitting, it means the model even cannot fit the training data, let alone the testing data or use it in realworld cases
 Why overfitting is bad?
 Although the model can fit training data well, but it’s too “well”, we cannot use it in other cases.
 How do we know when the fit is just right?
 Parsimony v.s. black box
 We often prefer a simpler model involving fewer variables over a blackbox predictor involving them all.
Accessing Model Accuracy
Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left{x_{i}, y_{i}\right}_{1}^{n}$, and we wish to see how it performs.
We could compute the average squared prediction error over $\text{Tr}$: $$ M S E_{T r}=A v e_{i \in \operatorname{Tr}}\left[y_{i}\hat{f}\left(x_{i}\right)\right]^{2} $$ And then we compute it using fresh test data $\operatorname{Te}=\left{x_{i}, y_{i}\right}{1}^{n}$ : $$ M S E{T e}=A v e_{i \in \operatorname{Te}}\left[y_{i}\hat{f}\left(x_{i}\right)\right]^{2} $$
Black curve is truth. Red curve on right is $M S E_{T e}$. Grey curve is $M S E_{T r}$.
Orange, blue and green curves/squares correspond to fits of different flexibility.
Choose the flexibility based on average test error amounts to a biasvariance tradeoff.
Supervised learning : classification problems
Here the response variable $Y$ is qualitative , e.g. email is one of $\C = \text{(spam, ham)}$, ham is good email ; digit class is one of $\C = {0, 1, …,9}$.
Our goals
 Build a classifier $C(X)$ That assigns a class label from $\C$ to a future unlabeled observation $X$
 Access the uncertainty in each classification
 Understand the roles of the different predictors among $X=X_1, X_2,…,X_p$
Bayes optimal classifier
Suppose the $K$ elements in $\C$ Are numbered $1,2,…,K$, Let: $$ p_k(x)=\text{Pr}(Y=KX=x),k=1,2,…K $$ These are the conditional/posterior class probabilities at x. Suppose those class probabilities are known, the Bayes optimal classifier at $x$ is: $$ C(x)=j \text { if } p_{j}(x)=\max \left{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right} $$