Resampling
In order to assess the performance of a learning algorithm, resampling strategies are usually used. The entire data set is split into (multiple) training and test sets. You train a learner on each training set, predict on the corresponding test set (sometimes on the training set as well) and calculate some performance measure. Then the individual performance values are aggregated, typically by calculating the mean. There exist various different resampling strategies, for example cross-validation and bootstrap, to mention just two popular approaches.
If you want to read up further details, the paper Resampling Strategies for Model Assessment and Selection by Simon is proabably not a bad choice. Bernd has also published a paper Resampling methods for meta-model validation with recommendations for evolutionary computation which contains detailed descriptions and lots of statistical background information on resampling methods.
In mlr the resampling strategy can be chosen via the function makeResampleDesc. The supported resampling strategies are:
- Cross-validation (
"CV"
), - Leave-one-out cross-validation (
"LOO""
), - Repeated cross-validation (
"RepCV"
), - Out-of-bag bootstrap and other variants (
"Bootstrap"
), - Subsampling, also called Monte-Carlo cross-validaton (
"Subsample"
), - Holdout (training/test) (
"Holdout"
).
The resample function evaluates the performance of a Learner using the specified resampling strategy for a given machine learning Task.
In the following example the performance of the Cox proportional hazards model on the lung data set is calculated using 3-fold cross-validation. Generally, in -fold cross-validation the data set is partitioned into subsets of (approximately) equal size. In the -th step of the iterations, the -th subset is used for testing, while the union of the remaining parts forms the training set. The default performance measure in survival analysis is the concordance index (cindex).
## Specify the resampling strategy (3-fold cross-validation)
rdesc = makeResampleDesc("CV", iters = 3)
## Calculate the performance
r = resample("surv.coxph", lung.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: cindex.test.mean=0.627
r
#> Resample Result
#> Task: lung-example
#> Learner: surv.coxph
#> cindex.aggr: 0.63
#> cindex.mean: 0.63
#> cindex.sd: 0.05
#> Runtime: 0.25112
## peak a little bit into r
names(r)
#> [1] "learner.id" "task.id" "measures.train" "measures.test"
#> [5] "aggr" "pred" "models" "err.msgs"
#> [9] "extract" "runtime"
r$aggr
#> cindex.test.mean
#> 0.6271182
r$measures.test
#> iter cindex
#> 1 1 0.5783027
#> 2 2 0.6324074
#> 3 3 0.6706444
r$measures.train
#> iter cindex
#> 1 1 NA
#> 2 2 NA
#> 3 3 NA
r$measures.test
gives the value of the performance measure on the 3 individual test
data sets.
r$aggr
shows the aggregated performance value.
Its name, "cindex.test.mean"
, indicates the performance measure, cindex,
and the method used to aggregate the 3 individual performances.
test.mean is the default method and, as the name implies, takes the mean over the
performances on the 3 test data sets.
No predictions on the training data sets were made and thus r$measures.train
contains missing values.
If predictions for the training set are required, too, set predict = "train"
or predict = "both"
in makeResampleDesc. This is necessary for some bootstrap methods (b632 and b632+) and
we will see some examples later on.
r$pred
is an object of class ResamplePrediction.
Just as a Prediction object (see the section on making predictions)
r$pred
has an element called "data"
which is a data.frame
that contains the
predictions and in case of a supervised learning problem the true values of the target
variable.
head(r$pred$data)
#> id truth.time truth.event response iter set
#> 1 1 455 TRUE -0.4951788 1 test
#> 2 2 210 TRUE 0.9573824 1 test
#> 3 4 310 TRUE 0.8069059 1 test
#> 4 10 613 TRUE 0.1918188 1 test
#> 5 12 61 TRUE 0.6638736 1 test
#> 6 14 81 TRUE -0.1873917 1 test
The columns iter
and set
indicate the resampling iteration and
if an individual prediction was made on the test or the training data set.
In the above example the performance measure is the concordance index (cindex). Of course, it is possible to compute multiple performance measures at once by passing a list of measures (see also the previous section on evaluating learner performance).
In the following we estimate the Dunn index (dunn), the Davies-Bouldin cluster separation measure (db), and the time for training the learner (timetrain) by subsampling with 5 iterations. In each iteration the data set is randomly partitioned into a training and a test set according to a given percentage, e.g., 2/3 training and 1/3 test set. If there is just one iteration, the strategy is commonly called holdout or test sample estimation.
## cluster iris feature data
task = makeClusterTask(data = iris[,-5])
## Subsampling with 5 iterations and default split 2/3
rdesc = makeResampleDesc("Subsample", iters = 5)
## Subsampling with 5 iterations and 4/5 training data
rdesc = makeResampleDesc("Subsample", iters = 5, split = 4/5)
## Calculate the three performance measures
r = resample("cluster.kmeans", task, rdesc, measures = list(dunn, db, timetrain))
#> [Resample] subsampling iter: 1
#> [Resample] subsampling iter: 2
#> [Resample] subsampling iter: 3
#> [Resample] subsampling iter: 4
#> [Resample] subsampling iter: 5
#> [Resample] Result: dunn.test.mean=0.274,db.test.mean=0.51,timetrain.test.mean=0.003
r$aggr
#> dunn.test.mean db.test.mean timetrain.test.mean
#> 0.2738893 0.5103655 0.0030000
Stratified resampling
For classification, it is usually desirable to have the same proportion of the classes in all of the partitions of the original data set. Stratified resampling ensures this.
This is particularly useful in case of imbalanced classes and small data sets. Otherwise it may happen, for example,
that observations of less frequent classes are missing in some of the training sets which can
decrease the performance of the learner, or lead to model crashes
In order to conduct stratified resampling, set stratify = TRUE
when calling makeResampleDesc.
## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3, stratify = TRUE)
r = resample("classif.lda", iris.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.02
Stratification is also available for survival tasks. Here the stratification balances the censoring rate.
Sometimes it is required to also stratify on the input data, e.g. to ensure that all subgroups are represented in all training and test sets.
To stratify on the input columns, specify factor columns of your task data via stratify.cols
rdesc = makeResampleDesc("CV", iters = 3, stratify.cols = "chas")
r = resample("regr.rpart", bh.task, rdesc)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mse.test.mean=23.2
Accessing individual learner models
In each resampling iteration a Learner is fitted on the respective training set.
By default, the resulting WrappedModels are not returned by resample.
If you want to keep them, set models = TRUE
when calling resample.
## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3)
r = resample("classif.lda", iris.task, rdesc, models = TRUE)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.02
r$models
#> [[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters:
#>
#> [[2]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters:
#>
#> [[3]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = iris-example; obs = 100; features = 4
#> Hyperparameters:
Keeping only certain information instead of entire models, for example the
variable importance in a regression tree, can be achieved using the extract
argument.
The function passed to extract
is applied to each model fitted on one of
the 3 training sets.
## 3-fold cross-validation
rdesc = makeResampleDesc("CV", iters = 3)
## Extract the variable importance in a regression tree
r = resample("regr.rpart", bh.task, rdesc,
extract = function(x) x$learner.model$variable.importance)
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mse.test.mean=30.3
r$extract
#> [[1]]
#> rm lstat crim indus age ptratio
#> 15228.2872 10742.2277 3893.2744 3651.6232 2601.5262 2551.8492
#> dis nox rad tax zn
#> 2498.2748 2419.5269 1014.2609 743.3742 308.8209
#>
#> [[2]]
#> lstat nox age indus crim rm
#> 15725.19021 9323.20270 8474.23077 8358.67000 8251.74446 7332.59637
#> zn dis tax rad ptratio b
#> 6151.29577 2741.12074 2055.67537 1216.01398 634.78381 71.00088
#>
#> [[3]]
#> rm lstat age ptratio nox dis
#> 15890.9279 13262.3672 4296.4175 3678.6651 3668.4944 3512.2753
#> crim tax indus zn b rad
#> 3474.5883 2844.9918 1437.7900 1284.4714 578.6932 496.2382
Resample descriptions and resample instances
As shown above, the function makeResampleDesc is used to specify the resampling strategy.
rdesc = makeResampleDesc("CV", iters = 3)
str(rdesc)
#> List of 4
#> $ id : chr "cross-validation"
#> $ iters : int 3
#> $ predict : chr "test"
#> $ stratify: logi FALSE
#> - attr(*, "class")= chr [1:2] "CVDesc" "ResampleDesc"
The result rdesc
is an object of class ResampleDesc and contains,
as the name implies, a description of the resampling strategy.
In principle, this is an instruction for drawing training and test sets including
the necessary parameters like the number of iterations, the sizes of the training and test
sets etc.
Based on this description, the data set is randomly partitioned into multiple training and test sets. For each iteration, we get a set of index vectors indicating the training and test examples. These are stored in a ResampleInstance.
If a ResampleDesc is passed to resample, it is instantiated internally. Naturally, it is also possible to pass a ResampleInstance directly.
A ResampleInstance can be created through the function makeResampleInstance given a ResampleDesc and either the size of the data set at hand or the Task. It basically performs the random drawing of indices to separate the data into training and test sets according to the description.
## Create a resample instance based an a task
rin = makeResampleInstance(rdesc, task = iris.task)
rin
#> Resample instance for 150 cases.
#> Resample description: cross-validation with 3 iterations.
#> Predict: test
#> Stratification: FALSE
## Create a resample instance given the size of the data set
rin = makeResampleInstance(rdesc, size = nrow(iris))
str(rin)
#> List of 5
#> $ desc :List of 4
#> ..$ id : chr "cross-validation"
#> ..$ iters : int 3
#> ..$ predict : chr "test"
#> ..$ stratify: logi FALSE
#> ..- attr(*, "class")= chr [1:2] "CVDesc" "ResampleDesc"
#> $ size : int 150
#> $ train.inds:List of 3
#> ..$ : int [1:100] 36 81 6 82 120 110 118 132 105 61 ...
#> ..$ : int [1:100] 6 119 120 110 121 118 99 100 29 127 ...
#> ..$ : int [1:100] 36 81 82 119 121 99 132 105 61 115 ...
#> $ test.inds :List of 3
#> ..$ : int [1:50] 2 3 4 5 7 9 11 16 22 24 ...
#> ..$ : int [1:50] 8 12 17 19 20 23 25 27 32 33 ...
#> ..$ : int [1:50] 1 6 10 13 14 15 18 21 29 31 ...
#> $ group : Factor w/ 0 levels:
#> - attr(*, "class")= chr "ResampleInstance"
## Access the indices of the training observations in iteration 3
rin$train.inds[[3]]
#> [1] 36 81 82 119 121 99 132 105 61 115 17 42 4 71 5 79 30
#> [18] 113 138 19 150 77 58 92 114 133 8 109 33 145 22 111 97 24
#> [35] 7 44 3 20 134 96 16 43 149 9 46 32 139 87 2 11 52
#> [52] 86 40 141 142 72 54 48 83 64 90 112 148 129 137 116 143 69
#> [69] 84 25 80 37 38 75 130 126 135 107 146 26 12 98 55 124 60
#> [86] 63 117 23 67 73 28 106 76 50 144 59 47 102 56 27
While having two separate objects, resample descriptions and instances as well as the resample function seems overly complicated, it has several advantages:
- Resample instances allow for paired experiments, that is comparing the performance of several learners on exactly the same training and test sets. This is particularly useful if you want to add another method to a comparison experiment you already did.
rdesc = makeResampleDesc("CV", iters = 3)
rin = makeResampleInstance(rdesc, task = iris.task)
## Calculate the performance of two learners based on the same resample instance
r.lda = resample("classif.lda", iris.task, rin, show.info = FALSE)
r.rpart = resample("classif.rpart", iris.task, rin, show.info = FALSE)
r.lda$aggr
#> mmce.test.mean
#> 0.02666667
r.rpart$aggr
#> mmce.test.mean
#> 0.06
- It is easy to add other resampling methods later on. You can simply derive from the ResampleInstance class, but you do not have to touch any methods that use the resampling strategy.
As mentioned above, when calling makeResampleInstance the index sets are drawn randomly. Mainly for holdout (test sample) estimation you might want full control about the training and tests set and specify them manually. This can be done using the function makeFixedHoldoutInstance.
rin = makeFixedHoldoutInstance(train.inds = 1:100, test.inds = 101:150, size = 150)
rin
#> Resample instance for 150 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
Aggregating performance values
In resampling we get (for each measure we wish to calculate) one performance value (on the test set, training set, or both) for each iteration. Subsequently, these are aggregated. As mentioned above, mainly the mean over the performance values on the test data sets (test.mean) is calculated.
For example, a 10-fold cross validation computes 10 values for the chosen performance measure. The aggregated value is the mean of these 10 numbers. mlr knows how to handle it because each Measure knows how it is aggregated:
## Mean misclassification error
mmce$aggr
#> Aggregation function: test.mean
## Root mean square error
rmse$aggr
#> Aggregation function: test.rmse
The aggregation method of a Measure can be changed via the function setAggregation. See the documentation of aggregations for available methods.
Example: Different measures and aggregations
test.median computes the median of the performance values on the test sets.
## We use the mean error rate and the median of the true positive rates
m1 = mmce
m2 = setAggregation(tpr, test.median)
rdesc = makeResampleDesc("CV", iters = 3)
r = resample("classif.rpart", sonar.task, rdesc, measures = list(m1, m2))
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.293,tpr.test.median=0.735
r$aggr
#> mmce.test.mean tpr.test.median
#> 0.2930987 0.7352941
Example: Calculating the training error
Here we calculate the mean misclassification error (mmce) on the training and the test
data sets. Note that we have to set predict = "both"
when calling makeResampleDesc
in order to get predictions on both data sets, training and test.
mmce.train.mean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 3, predict = "both")
r = resample("classif.rpart", iris.task, rdesc, measures = list(mmce, mmce.train.mean))
#> [Resample] cross-validation iter: 1
#> [Resample] cross-validation iter: 2
#> [Resample] cross-validation iter: 3
#> [Resample] Result: mmce.test.mean=0.0467,mmce.train.mean=0.0367
r$measures.train
#> iter mmce mmce
#> 1 1 0.04 0.04
#> 2 2 0.03 0.03
#> 3 3 0.04 0.04
r$aggr
#> mmce.test.mean mmce.train.mean
#> 0.04666667 0.03666667
Example: Bootstrap
In out-of-bag bootstrap estimation new data sets to are drawn from the data set with replacement, each of the same size as . In the -th iteration, forms the training set, while the remaining elements from , i.e., elements not in the training set, form the test set.
The variants b632 and b632+ calculate a convex combination of the training performance and the out-of-bag bootstrap performance and thus require predictions on the training sets and an appropriate aggregation strategy.
rdesc = makeResampleDesc("Bootstrap", predict = "both", iters = 10)
b632.mmce = setAggregation(mmce, b632)
b632plus.mmce = setAggregation(mmce, b632plus)
b632.mmce
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: b632
#> Note:
r = resample("classif.rpart", iris.task, rdesc,
measures = list(mmce, b632.mmce, b632plus.mmce), show.info = FALSE)
head(r$measures.train)
#> iter mmce mmce mmce
#> 1 1 0.026666667 0.026666667 0.026666667
#> 2 2 0.026666667 0.026666667 0.026666667
#> 3 3 0.006666667 0.006666667 0.006666667
#> 4 4 0.026666667 0.026666667 0.026666667
#> 5 5 0.033333333 0.033333333 0.033333333
#> 6 6 0.013333333 0.013333333 0.013333333
r$aggr
#> mmce.test.mean mmce.b632 mmce.b632plus
#> 0.07051905 0.05389071 0.05496489
Convenience functions
When quickly trying out some learners, it can get tedious to write the R code for generating a resample instance, setting the aggregation strategy and so on. For this reason mlr provides some convenience functions for the frequently used resampling strategies, for example holdout, crossval or bootstrapB632. But note that you do not have as much control and flexibility as when using resample with a resample description or instance.
holdout("regr.lm", bh.task, measures = list(mse, mae))
crossval("classif.lda", iris.task, iters = 3, measures = list(mmce, ber))