Machine Learning

Supervised learning & Unsupervised learning

Starting Point

Outcome measurement $Y$ (Also dependent variable, response, target )

  • In the regression problem, $Y$ is quantitative.
  • In the classification problem, $Y$ takes values in a finite, unordered set.

Vector of $p$ predictor measurements $X$ (also called inputs, regressors, covariates, features, independent variables)

Unsupervised Learning

Starting Point

  • No outcome varibale, just a set of predictors (features) measured on a set of samples.
  • Objective is more fuzzy - find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
  • Difficult to know how well you are doing.
  • Different from supervised learning, but can be useful as a pre-processing step for supervised learning or as an exploratory analysis tool

Our objectives

  • Accurately predict unseen test cases.
  • Understand which inputs affect the outcome, and how.
  • Access the quality of our predictions and inferences.

ML is to generalize knowledge beyond the training examples


  • It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

  • One has to understand the simpler methods first, in order to grasp the more sophisticated ones.

  • It is important to accurately assess the performance of a method, to know how well or how badly it is working [simple methods often perform as well as fancier ones!]

Supervised learning : regression problems

Find feature and response

$X$(Independent variable, feature, covariate, input): TV, Radio, Newspaper

$Y$(Dependent variable, target, response, output): Sales

We try to build a model: $$ \text{Sales} \approx f(\text{TV, Radio, Newspaper}) $$ We can refer to the input vector collectively as: $$ X=\left(\begin{array}{l} X_{1} \ X_{2} \ X_{3} \end{array}\right) $$ Now we can write our model as: $$ Y=f(x) + \varepsilon $$ $\varepsilon$ captures measurement errors and other discrepancies.

What is regression function?

The ideal $f(x)=E(Y|X=x)$ Is called the regression function.

What is our goal?

$f(x)$ is optimal predictor of $Y$ with regard to mean-squared prediction error $$ \text{Minimize}:E\left[(Y-g(X))^{2} \mid X=x\right] $$

How to estimate $f$ ?

Typically we have few if any data points with $X=4$ exactly, so we cannot compute $E(Y|X=x)$ !

What we do is to relax the definition and let:

$$ \hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x)) $$ where $\mathcal{N}(x)$ is some neighborhood of $x$.

Build linear model

$$ f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p} $$

  • A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1,…,\beta_p$

  • We estimate the parameters by fitting the model to training data

  • Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$

Interpretability and Flexibility

  • Why under-fitting is bad?
    • If a model is under-fitting, it means the model even cannot fit the training data, let alone the testing data or use it in real-world cases
  • Why over-fitting is bad?
    • Although the model can fit training data well, but it’s too “well”, we cannot use it in other cases.
  • How do we know when the fit is just right?
  • Parsimony v.s. black box
    • We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Accessing Model Accuracy

Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left{x_{i}, y_{i}\right}_{1}^{n}$, and we wish to see how it performs.

We could compute the average squared prediction error over $\text{Tr}$: $$ M S E_{T r}=A v e_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$ And then we compute it using fresh test data $\operatorname{Te}=\left{x_{i}, y_{i}\right}{1}^{n}$ : $$ M S E{T e}=A v e_{i \in \operatorname{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$

Black curve is truth. Red curve on right is $M S E_{T e}$. Grey curve is $M S E_{T r}$.

Orange, blue and green curves/squares correspond to fits of different flexibility.

Choose the flexibility based on average test error amounts to a bias-variance trade-off.

Supervised learning : classification problems

Here the response variable $Y$ is qualitative , e.g. email is one of $\C = \text{(spam, ham)}$, ham is good email ; digit class is one of $\C = {0, 1, …,9}$.

Our goals

  • Build a classifier $C(X)$ That assigns a class label from $\C$ to a future unlabeled observation $X$
  • Access the uncertainty in each classification
  • Understand the roles of the different predictors among $X=X_1, X_2,…,X_p$

Bayes optimal classifier

Suppose the $K$ elements in $\C$ Are numbered $1,2,…,K$, Let: $$ p_k(x)=\text{Pr}(Y=K|X=x),k=1,2,…K $$ These are the conditional/posterior class probabilities at x. Suppose those class probabilities are known, the Bayes optimal classifier at $x$ is: $$ C(x)=j \text { if } p_{j}(x)=\max \left{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right} $$