Resampling

In order to assess the performance of a learning algorithm, resampling strategies are usually used. The entire data set is split into (multiple) training and test sets. You train a learner on each training set, predict on the corresponding test set (sometimes on the training set as well) and calculate some performance measure. Then the individual performance values are aggregated, typically by calculating the mean. There exist various different resampling strategies, for example cross-validation and bootstrap, to mention just two popular approaches.

Resampling Figure

If you want to read up further details, the paper Resampling Strategies for Model Assessment and Selection by Simon is proabably not a bad choice. Bernd has also published a paper Resampling methods for meta-model validation with recommendations for evolutionary computation which contains detailed descriptions and lots of statistical background information on resampling methods.

In mlr the resampling strategy can be chosen via the function makeResampleDesc. The supported resampling strategies are:

The resample function evaluates the performance of a Learner using the specified resampling strategy for a given machine learning Task.

In the following example the performance of the Cox proportional hazards model on the lung data set is calculated using 3-fold cross-validation. Generally, in -fold cross-validation the data set is partitioned into subsets of (approximately) equal size. In the -th step of the iterations, the -th subset is used for testing, while the union of the remaining parts forms the training set. The default performance measure in survival analysis is the concordance index (cindex).

## Specify the resampling strategy (3-fold cross-validation)
rdesc = makeResampleDesc("CV", iters = 3)

## Calculate the performance
r = resample("surv.coxph", lung.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: cindex.test.mean=0.627
r
#> Resample Result
#> Task: lung-example
#> Learner: surv.coxph
#> cindex.aggr: 0.63
#> cindex.mean: 0.63
#> cindex.sd: 0.05
#> Runtime: 0.25112
## peak a little bit into r
names(r)
#>  [1] "learner.id"     "task.id"        "measures.train" "measures.test" 
#>  [5] "aggr"           "pred"           "models"         "err.msgs"      
#>  [9] "extract"        "runtime"
r$aggr
#> cindex.test.mean 
#>        0.6271182
r$measures.test
#>   iter    cindex
#> 1    1 0.5783027
#> 2    2 0.6324074
#> 3    3 0.6706444
r$measures.train
#>   iter cindex
#> 1    1     NA
#> 2    2     NA
#> 3    3     NA

r$measures.test gives the value of the performance measure on the 3 individual test data sets. r$aggr shows the aggregated performance value. Its name, "cindex.test.mean", indicates the performance measure, cindex, and the method used to aggregate the 3 individual performances. test.mean is the default method and, as the name implies, takes the mean over the performances on the 3 test data sets. No predictions on the training data sets were made and thus r$measures.train contains missing values.

If predictions for the training set are required, too, set predict = "train"or predict = "both" in makeResampleDesc. This is necessary for some bootstrap methods (b632 and b632+) and we will see some examples later on.

r$pred is an object of class ResamplePrediction. Just as a Prediction object (see the section on making predictions) r$pred has an element called "data" which is a data.frame that contains the predictions and in case of a supervised learning problem the true values of the target variable.

head(r$pred$data)
#>   id truth.time truth.event   response iter  set
#> 1  1        455        TRUE -0.4951788    1 test
#> 2  2        210        TRUE  0.9573824    1 test
#> 3  4        310        TRUE  0.8069059    1 test
#> 4 10        613        TRUE  0.1918188    1 test
#> 5 12         61        TRUE  0.6638736    1 test
#> 6 14         81        TRUE -0.1873917    1 test

The columns iter and setindicate the resampling iteration and if an individual prediction was made on the test or the training data set.

In the above example the performance measure is the concordance index (cindex). Of course, it is possible to compute multiple performance measures at once by passing a list of measures (see also the previous section on evaluating learner performance).

In the following we estimate the Dunn index (dunn), the Davies-Bouldin cluster separation measure (db), and the time for training the learner (timetrain) by subsampling with 5 iterations. In each iteration the data set is randomly partitioned into a training and a test set according to a given percentage, e.g., 2/3 training and 1/3 test set. If there is just one iteration, the strategy is commonly called holdout or test sample estimation.

## cluster iris feature data
task = makeClusterTask(data = iris[,-5])
## Subsampling with 5 iterations and default split 2/3
rdesc = makeResampleDesc("Subsample", iters = 5)
## Subsampling with 5 iterations and 4/5 training data
rdesc = makeResampleDesc("Subsample", iters = 5, split = 4/5)

## Calculate the three performance measures
r = resample("cluster.kmeans", task, rdesc, measures = list(dunn, db, timetrain))
#> [Resample] subsampling iter: 1
#> [Resample] subsampling iter: 2
#> [Resample] subsampling iter: 3
#> [Resample] subsampling iter: 4
#> [Resample] subsampling iter: 5
#> [Resample] Result: dunn.test.mean=0.274,db.test.mean=0.51,timetrain.test.mean=0.003
r$aggr
#>      dunn.test.mean        db.test.mean timetrain.test.mean 
#>           0.2738893           0.5103655           0.0030000

Stratified resampling

For classification, it is usually desirable to have the same proportion of the classes in all of the partitions of the original data set. Stratified resampling ensures this. This is particularly useful in case of imbalanced classes and small data sets. Otherwise it may happen, for example, that observations of less frequent classes are missing in some of the training sets which can decrease the performance of the learner, or lead to model crashes In order to conduct stratified resampling, set stratify = TRUE when calling makeResampleDesc.

## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3, stratify = TRUE)

r = resample("classif.lda", iris.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.02

Stratification is also available for survival tasks. Here the stratification balances the censoring rate.

Sometimes it is required to also stratify on the input data, e.g. to ensure that all subgroups are represented in all training and test sets. To stratify on the input columns, specify factor columns of your task data via stratify.cols

rdesc = makeResampleDesc("CV", iters = 3, stratify.cols = "chas")
r = resample("regr.rpart", bh.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mse.test.mean=23.2

Accessing individual learner models

In each resampling iteration a Learner is fitted on the respective training set. By default, the resulting WrappedModels are not returned by resample. If you want to keep them, set models = TRUE when calling resample.

## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3)

r = resample("classif.lda", iris.task, rdesc, models = TRUE)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.02
r$models
#> [[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters: 
#> 
#> [[2]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters: 
#> 
#> [[3]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters:

Keeping only certain information instead of entire models, for example the variable importance in a regression tree, can be achieved using the extract argument. The function passed to extract is applied to each model fitted on one of the 3 training sets.

## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3)

## Extract the variable importance in a regression tree
r = resample("regr.rpart", bh.task, rdesc,
    extract = function(x) x$learner.model$variable.importance)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mse.test.mean=30.3
r$extract
#> [[1]]
#>         rm      lstat       crim      indus        age    ptratio 
#> 15228.2872 10742.2277  3893.2744  3651.6232  2601.5262  2551.8492 
#>        dis        nox        rad        tax         zn 
#>  2498.2748  2419.5269  1014.2609   743.3742   308.8209 
#> 
#> [[2]]
#>       lstat         nox         age       indus        crim          rm 
#> 15725.19021  9323.20270  8474.23077  8358.67000  8251.74446  7332.59637 
#>          zn         dis         tax         rad     ptratio           b 
#>  6151.29577  2741.12074  2055.67537  1216.01398   634.78381    71.00088 
#> 
#> [[3]]
#>         rm      lstat        age    ptratio        nox        dis 
#> 15890.9279 13262.3672  4296.4175  3678.6651  3668.4944  3512.2753 
#>       crim        tax      indus         zn          b        rad 
#>  3474.5883  2844.9918  1437.7900  1284.4714   578.6932   496.2382

Resample descriptions and resample instances

As shown above, the function makeResampleDesc is used to specify the resampling strategy.

rdesc = makeResampleDesc("CV", iters = 3)
str(rdesc)
#> List of 4
#>  $ id      : chr "cross-validation"
#>  $ iters   : int 3
#>  $ predict : chr "test"
#>  $ stratify: logi FALSE
#>  - attr(*, "class")= chr [1:2] "CVDesc" "ResampleDesc"

The result rdescis an object of class ResampleDesc and contains, as the name implies, a description of the resampling strategy. In principle, this is an instruction for drawing training and test sets including the necessary parameters like the number of iterations, the sizes of the training and test sets etc.

Based on this description, the data set is randomly partitioned into multiple training and test sets. For each iteration, we get a set of index vectors indicating the training and test examples. These are stored in a ResampleInstance.

If a ResampleDesc is passed to resample, it is instantiated internally. Naturally, it is also possible to pass a ResampleInstance directly.

A ResampleInstance can be created through the function makeResampleInstance given a ResampleDesc and either the size of the data set at hand or the Task. It basically performs the random drawing of indices to separate the data into training and test sets according to the description.

## Create a resample instance based an a task
rin = makeResampleInstance(rdesc, task = iris.task)
rin
#> Resample instance for 150 cases.
#> Resample description: cross-validation with 3 iterations.
#> Predict: test
#> Stratification: FALSE

## Create a resample instance given the size of the data set
rin = makeResampleInstance(rdesc, size = nrow(iris))
str(rin)
#> List of 5
#>  $ desc      :List of 4
#>   ..$ id      : chr "cross-validation"
#>   ..$ iters   : int 3
#>   ..$ predict : chr "test"
#>   ..$ stratify: logi FALSE
#>   ..- attr(*, "class")= chr [1:2] "CVDesc" "ResampleDesc"
#>  $ size      : int 150
#>  $ train.inds:List of 3
#>   ..$ : int [1:100] 36 81 6 82 120 110 118 132 105 61 ...
#>   ..$ : int [1:100] 6 119 120 110 121 118 99 100 29 127 ...
#>   ..$ : int [1:100] 36 81 82 119 121 99 132 105 61 115 ...
#>  $ test.inds :List of 3
#>   ..$ : int [1:50] 2 3 4 5 7 9 11 16 22 24 ...
#>   ..$ : int [1:50] 8 12 17 19 20 23 25 27 32 33 ...
#>   ..$ : int [1:50] 1 6 10 13 14 15 18 21 29 31 ...
#>  $ group     : Factor w/ 0 levels: 
#>  - attr(*, "class")= chr "ResampleInstance"

## Access the indices of the training observations in iteration 3
rin$train.inds[[3]]
#>   [1]  36  81  82 119 121  99 132 105  61 115  17  42   4  71   5  79  30
#>  [18] 113 138  19 150  77  58  92 114 133   8 109  33 145  22 111  97  24
#>  [35]   7  44   3  20 134  96  16  43 149   9  46  32 139  87   2  11  52
#>  [52]  86  40 141 142  72  54  48  83  64  90 112 148 129 137 116 143  69
#>  [69]  84  25  80  37  38  75 130 126 135 107 146  26  12  98  55 124  60
#>  [86]  63 117  23  67  73  28 106  76  50 144  59  47 102  56  27

While having two separate objects, resample descriptions and instances as well as the resample function seems overly complicated, it has several advantages:

rdesc = makeResampleDesc("CV", iters = 3)
rin = makeResampleInstance(rdesc, task = iris.task)

## Calculate the performance of two learners based on the same resample instance
r.lda = resample("classif.lda", iris.task, rin, show.info = FALSE)
r.rpart = resample("classif.rpart", iris.task, rin, show.info = FALSE)
r.lda$aggr
#> mmce.test.mean 
#>     0.02666667
r.rpart$aggr
#> mmce.test.mean 
#>           0.06

As mentioned above, when calling makeResampleInstance the index sets are drawn randomly. Mainly for holdout (test sample) estimation you might want full control about the training and tests set and specify them manually. This can be done using the function makeFixedHoldoutInstance.

rin = makeFixedHoldoutInstance(train.inds = 1:100, test.inds = 101:150, size = 150)
rin
#> Resample instance for 150 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE

Aggregating performance values

In resampling we get (for each measure we wish to calculate) one performance value (on the test set, training set, or both) for each iteration. Subsequently, these are aggregated. As mentioned above, mainly the mean over the performance values on the test data sets (test.mean) is calculated.

For example, a 10-fold cross validation computes 10 values for the chosen performance measure. The aggregated value is the mean of these 10 numbers. mlr knows how to handle it because each Measure knows how it is aggregated:

## Mean misclassification error
mmce$aggr
#> Aggregation function: test.mean

## Root mean square error
rmse$aggr
#> Aggregation function: test.rmse

The aggregation method of a Measure can be changed via the function setAggregation. See the documentation of aggregations for available methods.

Example: Different measures and aggregations

test.median computes the median of the performance values on the test sets.

## We use the mean error rate and the median of the true positive rates
m1 = mmce
m2 = setAggregation(tpr, test.median)
rdesc = makeResampleDesc("CV", iters = 3)
r = resample("classif.rpart", sonar.task, rdesc, measures = list(m1, m2))
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.293,tpr.test.median=0.735
r$aggr
#>  mmce.test.mean tpr.test.median 
#>       0.2930987       0.7352941

Example: Calculating the training error

Here we calculate the mean misclassification error (mmce) on the training and the test data sets. Note that we have to set predict = "both"when calling makeResampleDesc in order to get predictions on both data sets, training and test.

mmce.train.mean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 3, predict = "both")
r = resample("classif.rpart", iris.task, rdesc, measures = list(mmce, mmce.train.mean))
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.0467,mmce.train.mean=0.0367
r$measures.train
#>   iter mmce mmce
#> 1    1 0.04 0.04
#> 2    2 0.03 0.03
#> 3    3 0.04 0.04
r$aggr
#>  mmce.test.mean mmce.train.mean 
#>      0.04666667      0.03666667

Example: Bootstrap

In out-of-bag bootstrap estimation new data sets to are drawn from the data set with replacement, each of the same size as . In the -th iteration, forms the training set, while the remaining elements from , i.e., elements not in the training set, form the test set.

The variants b632 and b632+ calculate a convex combination of the training performance and the out-of-bag bootstrap performance and thus require predictions on the training sets and an appropriate aggregation strategy.

rdesc = makeResampleDesc("Bootstrap", predict = "both", iters = 10)
b632.mmce = setAggregation(mmce, b632)
b632plus.mmce = setAggregation(mmce, b632plus)
b632.mmce
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: b632
#> Note:

r = resample("classif.rpart", iris.task, rdesc,
    measures = list(mmce, b632.mmce, b632plus.mmce), show.info = FALSE)
head(r$measures.train)
#>   iter        mmce        mmce        mmce
#> 1    1 0.026666667 0.026666667 0.026666667
#> 2    2 0.026666667 0.026666667 0.026666667
#> 3    3 0.006666667 0.006666667 0.006666667
#> 4    4 0.026666667 0.026666667 0.026666667
#> 5    5 0.033333333 0.033333333 0.033333333
#> 6    6 0.013333333 0.013333333 0.013333333
r$aggr
#> mmce.test.mean      mmce.b632  mmce.b632plus 
#>     0.07051905     0.05389071     0.05496489

Convenience functions

When quickly trying out some learners, it can get tedious to write the R code for generating a resample instance, setting the aggregation strategy and so on. For this reason mlr provides some convenience functions for the frequently used resampling strategies, for example holdout, crossval or bootstrapB632. But note that you do not have as much control and flexibility as when using resample with a resample description or instance.

holdout("regr.lm", bh.task, measures = list(mse, mae))
crossval("classif.lda", iris.task, iters = 3, measures = list(mmce, ber))