Machine Learning

Supervised learning & Unsupervised learning

Starting Point

Outcome measurement $Y$ (Also dependent variable, response, target )

In the regression problem, $Y$ is quantitative.
In the classification problem, $Y$ takes values in a finite, unordered set.

Vector of $p$ predictor measurements $X$ (also called inputs, regressors, covariates, features, independent variables)

Unsupervised Learning

Starting Point

No outcome varibale, just a set of predictors (features) measured on a set of samples.
Objective is more fuzzy - find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
Difficult to know how well you are doing.
Different from supervised learning, but can be useful as a pre-processing step for supervised learning or as an exploratory analysis tool

Our objectives

Accurately predict unseen test cases.
Understand which inputs affect the outcome, and how.
Access the quality of our predictions and inferences.

ML is to generalize knowledge beyond the training examples

Philosophy

It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
It is important to accurately assess the performance of a method, to know how well or how badly it is working [simple methods often perform as well as fancier ones!]

Supervised learning : regression problems

Find feature and response

$X$(Independent variable, feature, covariate, input): TV, Radio, Newspaper

$Y$(Dependent variable, target, response, output): Sales

We try to build a model: $$ \text{Sales} \approx f(\text{TV, Radio, Newspaper}) $$ We can refer to the input vector collectively as: $$ X=\left(\begin{array}{l} X_{1} \ X_{2} \ X_{3} \end{array}\right) $$ Now we can write our model as: $$ Y=f(x) + \varepsilon $$ $\varepsilon$ captures measurement errors and other discrepancies.

What is regression function?

The ideal $f(x)=E(Y|X=x)$ Is called the regression function.

What is our goal?

$f(x)$ is optimal predictor of $Y$ with regard to mean-squared prediction error $$ \text{Minimize}:E\left[(Y-g(X))^{2} \mid X=x\right] $$

How to estimate $f$ ?

Typically we have few if any data points with $X=4$ exactly, so we cannot compute $E(Y|X=x)$ !

What we do is to relax the definition and let:

$$ \hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x)) $$ where $\mathcal{N}(x)$ is some neighborhood of $x$.

Build linear model

$$ f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p} $$

A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1,…,\beta_p$
We estimate the parameters by fitting the model to training data
Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$

Interpretability and Flexibility

Why under-fitting is bad?
- If a model is under-fitting, it means the model even cannot fit the training data, let alone the testing data or use it in real-world cases
Why over-fitting is bad?
- Although the model can fit training data well, but it’s too “well”, we cannot use it in other cases.
How do we know when the fit is just right?
Parsimony v.s. black box
- We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Accessing Model Accuracy

Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left{x_{i}, y_{i}\right}_{1}^{n}$, and we wish to see how it performs.

We could compute the average squared prediction error over $\text{Tr}$: $$ M S E_{T r}=A v e_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$ And then we compute it using fresh test data $\operatorname{Te}=\left{x_{i}, y_{i}\right}{1}^{n}$ : $$ M S E{T e}=A v e_{i \in \operatorname{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$

Black curve is truth. Red curve on right is $M S E_{T e}$. Grey curve is $M S E_{T r}$.

Orange, blue and green curves/squares correspond to fits of different flexibility.

Choose the flexibility based on average test error amounts to a bias-variance trade-off.

Supervised learning : classification problems

Here the response variable $Y$ is qualitative , e.g. email is one of $\C = \text{(spam, ham)}$, ham is good email ; digit class is one of $\C = {0, 1, …,9}$.

Our goals

Build a classifier $C(X)$ That assigns a class label from $\C$ to a future unlabeled observation $X$
Access the uncertainty in each classification
Understand the roles of the different predictors among $X=X_1, X_2,…,X_p$

Bayes optimal classifier

Suppose the $K$ elements in $\C$ Are numbered $1,2,…,K$, Let: $$ p_k(x)=\text{Pr}(Y=K|X=x),k=1,2,…K $$ These are the conditional/posterior class probabilities at x. Suppose those class probabilities are known, the Bayes optimal classifier at $x$ is: $$ C(x)=j \text { if } p_{j}(x)=\max \left{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right} $$

Machine Learning#

Supervised learning & Unsupervised learning#

Starting Point#

Unsupervised Learning#

Starting Point#

Our objectives#

Philosophy#

Supervised learning : regression problems#

Find feature and response#

What is regression function?#

What is our goal?#

How to estimate $f$ ?#

Build linear model#

Interpretability and Flexibility#

Accessing Model Accuracy#

Supervised learning : classification problems#

Our goals#

Bayes optimal classifier#

Machine Learning

Supervised learning & Unsupervised learning

Starting Point

Unsupervised Learning

Starting Point

Our objectives

Philosophy

Supervised learning : regression problems

Find feature and response

What is regression function?

What is our goal?

How to estimate $f$ ?

Build linear model

Interpretability and Flexibility

Accessing Model Accuracy

Supervised learning : classification problems

Our goals

Bayes optimal classifier