Machine Learning#

Supervised learning & Unsupervised learning#

Starting Point#

Outcome measurement $Y$ (Also dependent variable, response, target )

• In the regression problem, $Y$ is quantitative.
• In the classification problem, $Y$ takes values in a finite, unordered set.

Vector of $p$ predictor measurements $X$ (also called inputs, regressors, covariates, features, independent variables)

Unsupervised Learning#

Starting Point#

• No outcome varibale, just a set of predictors (features) measured on a set of samples.
• Objective is more fuzzy - find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
• Difficult to know how well you are doing.
• Different from supervised learning, but can be useful as a pre-processing step for supervised learning or as an exploratory analysis tool

Our objectives#

• Accurately predict unseen test cases.
• Understand which inputs affect the outcome, and how.
• Access the quality of our predictions and inferences.

ML is to generalize knowledge beyond the training examples

Philosophy#

• It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

• One has to understand the simpler methods first, in order to grasp the more sophisticated ones.

• It is important to accurately assess the performance of a method, to know how well or how badly it is working [simple methods often perform as well as fancier ones!]

Supervised learning : regression problems#

Find feature and response#

$X$(Independent variable, feature, covariate, input): TV, Radio, Newspaper

$Y$(Dependent variable, target, response, output): Sales

We try to build a model: $$\text{Sales} \approx f(\text{TV, Radio, Newspaper})$$ We can refer to the input vector collectively as: $$X=\left(\begin{array}{l} X_{1} \ X_{2} \ X_{3} \end{array}\right)$$ Now we can write our model as: $$Y=f(x) + \varepsilon$$ $\varepsilon$ captures measurement errors and other discrepancies.

What is regression function?#

The ideal $f(x)=E(Y|X=x)$ Is called the regression function.

What is our goal?#

$f(x)$ is optimal predictor of $Y$ with regard to mean-squared prediction error $$\text{Minimize}:E\left[(Y-g(X))^{2} \mid X=x\right]$$

How to estimate $f$ ?#

Typically we have few if any data points with $X=4$ exactly, so we cannot compute $E(Y|X=x)$ !

What we do is to relax the definition and let:

$$\hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x))$$ where $\mathcal{N}(x)$ is some neighborhood of $x$.

Build linear model#

$$f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}$$

• A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1,…,\beta_p$

• We estimate the parameters by fitting the model to training data

• Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$

Interpretability and Flexibility#

• Why under-fitting is bad?
• If a model is under-fitting, it means the model even cannot fit the training data, let alone the testing data or use it in real-world cases
• Why over-fitting is bad?
• Although the model can fit training data well, but it’s too “well”, we cannot use it in other cases.
• How do we know when the fit is just right?
• Parsimony v.s. black box
• We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Accessing Model Accuracy#

Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left{x_{i}, y_{i}\right}_{1}^{n}$, and we wish to see how it performs.

We could compute the average squared prediction error over $\text{Tr}$: $$M S E_{T r}=A v e_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$$ And then we compute it using fresh test data $\operatorname{Te}=\left{x_{i}, y_{i}\right}{1}^{n}$ : $$M S E{T e}=A v e_{i \in \operatorname{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$$

Black curve is truth. Red curve on right is $M S E_{T e}$. Grey curve is $M S E_{T r}$.

Orange, blue and green curves/squares correspond to fits of different flexibility.

Choose the flexibility based on average test error amounts to a bias-variance trade-off.

Supervised learning : classification problems#

Here the response variable $Y$ is qualitative , e.g. email is one of $\C = \text{(spam, ham)}$, ham is good email ; digit class is one of $\C = {0, 1, …,9}$.

Our goals#

• Build a classifier $C(X)$ That assigns a class label from $\C$ to a future unlabeled observation $X$
• Access the uncertainty in each classification
• Understand the roles of the different predictors among $X=X_1, X_2,…,X_p$

Bayes optimal classifier#

Suppose the $K$ elements in $\C$ Are numbered $1,2,…,K$, Let: $$p_k(x)=\text{Pr}(Y=K|X=x),k=1,2,…K$$ These are the conditional/posterior class probabilities at x. Suppose those class probabilities are known, the Bayes optimal classifier at $x$ is: $$C(x)=j \text { if } p_{j}(x)=\max \left{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right}$$