When deal with statistical models, one common issue is we don’t have enough data. To tackle this problem, we have bootstrap which can generate “new” data from data we already have.

Take-out

The principle behind bootstrap method is simple:

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

Here are few important things:

  1. Bootstrap is what we use to estimate statistics by sampling dataset with replacement. This means bootstrap is beyond just generate data from original data.
  2. Bootstrap can be used for quantifying uncertainty of a given estimator, and that’s what “estimate statistics” means.

How bootstrap works

Generally, it works like below:

  • Choose a number of bootstrap samples to perform
  • Choose a sample size
  • For each bootstrap sample
  • Draw a sample with replacement with the chosen size
  • Calculate the statistic on the sample
  • Calculate the mean of the calculated sample statistics.

More specifically, in statistical inference, or the trendy term “machine learning” people use nowadays, it works like this way:

  • Choose a number of bootstrap samples to perform
  • Choose a sample size
  • For each bootstrap sample
  • Draw a sample with replacement with the chosen size
  • Fit a model on the data sample (aka train dataset)
  • Estimate the skill of the model on the out-of-bag sample. (aka test dataset)
  • Calculate the mean of the sample of model skill estimates.

Easy example

Now suppose we have a dataset with 6 data points (observations):

data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

We now apply bootstrap method:

  1. Choose number of bootstrap sample, here we choose 1, which means we generate 1 bootstrap sample dataset. For the sample size, we choose 3, which means in each bootstrap sample dataset, it contains 3 observations.
  2. Randomly select data point (observation) from the original dataset data, here we randomly select 0.4
  3. Repeat this procedure for 3 times, since we choose the sample size of 3 above.
  4. Now we have a bootstrap dataset sample:
  • bootstrap_dataset = [0.4, 0.5, 0.4]
  1. Now with this bootstrap_dataset, we can call it another name: train_data, while these data points not in this bootstrap_dataset (train_data) : 0.1, 0.2, 0.3, 0.6, we call them test_data (some call them OOB, out of observation).
  2. We use train_data to build & fit a model, use test_data to test our model’s accuracy. (Here is how bootstrap can be used for quantifying uncertainty of a given estimator, and that’s what “estimate statistics” means.)

Reference: A Gentle Introduction to the Bootstrap Method