Tuning Hyperparameters
Many machine learning algorithms have hyperparameters that need to be set. If selected by the user they can be specified as explained in the section about Learners -- simply pass them to makeLearner. Often suitable parameter values are not obvious and it is preferable to tune the hyperparameters, that is automatically identify values that lead to the best performance.
Basics
For tuning you have to specify
- the search space,
- the optimization algorithm,
- an evaluation method, i.e., a resampling strategy and a performance measure.
The last point is already covered in this tutorial in the parts about the evaluation of learning methods and resampling.
Below we show how to specify the search space and optimization algorithm, how to do the tuning and how to access the tuning result, using the example of a grid search.
Throughout this section we consider classification examples. For the other types of learning problems tuning works analogously.
Grid search with manual discretization
A grid search is one of the standard -- albeit slow -- ways to choose an appropriate set of parameters from a given range of values.
We use the iris classification task for illustration and tune the hyperparameters of an SVM (function ksvm from the kernlab package) with a radial basis kernel.
First, we create a ParamSet object, which describes the
parameter space we wish to search.
This is done via function makeParamSet.
We wish to tune the cost parameter C
and the RBF kernel parameter sigma
of the
ksvm function.
Since we will use a grid search strategy, we add discrete parameters to the parameter set.
The specified values
have to be vectors of feasible settings and the complete grid simply is
their cross-product.
Every entry in the parameter set has to be named according to the corresponding parameter
of the underlying R function.
Please note that whenever parameters in the underlying R functions should be
passed in a list structure, mlr tries to give you direct access to
each parameter and get rid of the list structure.
This is the case with the kpar
argument of ksvm which is a list of
kernel parameters like sigma
.
ps = makeParamSet(
makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2))
)
Additional to the parameter set, we need an instance of a TuneControl object. These describe the optimization strategy to be used and its settings. Here we choose a grid search:
ctrl = makeTuneControlGrid()
We will use 3-fold cross-validation to assess the quality of a specific parameter setting. For this we need to create a resampling description just like in the resampling part of the tutorial.
rdesc = makeResampleDesc("CV", iters = 3L)
Finally, by combining all the previous pieces, we can tune the SVM parameters by calling tuneParams.
res = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc, par.set = ps,
control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#> Type len Def Constr Req Tunable Trafo
#> C discrete - - 0.25,0.5,1,2,4 - TRUE -
#> sigma discrete - - 0.25,0.5,1,2,4 - TRUE -
#> With control class: TuneControlGrid
#> Imputation value: 1
#> [Tune-x] 1: C=0.25; sigma=0.25
#> [Tune-y] 1: mmce.test.mean=0.0467; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 2: C=0.5; sigma=0.25
#> [Tune-y] 2: mmce.test.mean=0.0467; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 3: C=1; sigma=0.25
#> [Tune-y] 3: mmce.test.mean=0.04; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 4: C=2; sigma=0.25
#> [Tune-y] 4: mmce.test.mean=0.0467; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 5: C=4; sigma=0.25
#> [Tune-y] 5: mmce.test.mean=0.0467; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 6: C=0.25; sigma=0.5
#> [Tune-y] 6: mmce.test.mean=0.06; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 7: C=0.5; sigma=0.5
#> [Tune-y] 7: mmce.test.mean=0.04; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 8: C=1; sigma=0.5
#> [Tune-y] 8: mmce.test.mean=0.04; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 9: C=2; sigma=0.5
#> [Tune-y] 9: mmce.test.mean=0.0467; time: 0.0 min; memory: 167Mb use, 476Mb max
#> [Tune-x] 10: C=4; sigma=0.5
#> [Tune-y] 10: mmce.test.mean=0.0467; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 11: C=0.25; sigma=1
#> [Tune-y] 11: mmce.test.mean=0.0533; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 12: C=0.5; sigma=1
#> [Tune-y] 12: mmce.test.mean=0.04; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 13: C=1; sigma=1
#> [Tune-y] 13: mmce.test.mean=0.0467; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 14: C=2; sigma=1
#> [Tune-y] 14: mmce.test.mean=0.0467; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 15: C=4; sigma=1
#> [Tune-y] 15: mmce.test.mean=0.0533; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 16: C=0.25; sigma=2
#> [Tune-y] 16: mmce.test.mean=0.0533; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 17: C=0.5; sigma=2
#> [Tune-y] 17: mmce.test.mean=0.04; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 18: C=1; sigma=2
#> [Tune-y] 18: mmce.test.mean=0.0333; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 19: C=2; sigma=2
#> [Tune-y] 19: mmce.test.mean=0.04; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 20: C=4; sigma=2
#> [Tune-y] 20: mmce.test.mean=0.0467; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 21: C=0.25; sigma=4
#> [Tune-y] 21: mmce.test.mean=0.113; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 22: C=0.5; sigma=4
#> [Tune-y] 22: mmce.test.mean=0.0667; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 23: C=1; sigma=4
#> [Tune-y] 23: mmce.test.mean=0.0533; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 24: C=2; sigma=4
#> [Tune-y] 24: mmce.test.mean=0.06; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 25: C=4; sigma=4
#> [Tune-y] 25: mmce.test.mean=0.0667; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune] Result: C=1; sigma=2 : mmce.test.mean=0.0333
res
#> Tune result:
#> Op. pars: C=1; sigma=2
#> mmce.test.mean=0.0333
tuneParams simply performs the cross-validation for every element of the cross-product and selects the parameter setting with the best mean performance. As no performance measure was specified, by default the error rate (mmce) is used.
Note that each measure "knows" if it is minimized or maximized during tuning.
## error rate
mmce$minimize
#> [1] TRUE
## accuracy
acc$minimize
#> [1] FALSE
Of course, you can pass other measures and also a list of measures to tuneParams. In the latter case the first measure is optimized during tuning, the others are simply evaluated. If you are interested in optimizing several measures simultaneously have a look at the paragraph about multi-criteria tuning below.
In the example below we calculate the accuracy (acc) instead of the error rate. We use function setAggregation, as described in the section on resampling, to additionally obtain the standard deviation of the accuracy.
res = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc, par.set = ps,
control = ctrl, measures = list(acc, setAggregation(acc, test.sd)), show.info = FALSE)
res
#> Tune result:
#> Op. pars: C=0.25; sigma=0.25
#> acc.test.mean=0.953,acc.test.sd=0.0306
Accessing the tuning result
The result object TuneResult allows you to access the best found settings $x
and their
estimated performance $y
.
res$x
#> $C
#> [1] 0.25
#>
#> $sigma
#> [1] 0.25
res$y
#> acc.test.mean acc.test.sd
#> 0.9533333 0.0305505
Moreover, we can inspect all points evaluated during the search by accessing the
$opt.path
(see also the documentation of OptPath).
res$opt.path
#> Optimization path
#> Dimensions: x = 2/2, y = 2
#> Length: 25
#> Add x values transformed: FALSE
#> Error messages: TRUE. Errors: 0 / 25.
#> Exec times: TRUE. Range: 0.088 - 0.255. 0 NAs.
opt.grid = as.data.frame(res$opt.path)
head(opt.grid)
#> C sigma acc.test.mean acc.test.sd dob eol error.message exec.time
#> 1 0.25 0.25 0.9533333 0.03055050 1 NA <NA> 0.094
#> 2 0.5 0.25 0.9466667 0.02309401 2 NA <NA> 0.129
#> 3 1 0.25 0.9533333 0.01154701 3 NA <NA> 0.093
#> 4 2 0.25 0.9533333 0.01154701 4 NA <NA> 0.088
#> 5 4 0.25 0.9533333 0.01154701 5 NA <NA> 0.096
#> 6 0.25 0.5 0.9333333 0.01154701 6 NA <NA> 0.125
A quick visualization of the performance values on the search grid can be accomplished as follows:
library(ggplot2)
g = ggplot(opt.grid, aes(x = C, y = sigma, fill = acc.test.mean, label = round(acc.test.sd, 3)))
g + geom_tile() + geom_text(color = "white")
The colors of the tiles display the achieved accuracy, the tile labels show the standard deviation.
Using the optimal parameter values
After tuning you can generate a Learner with optimal hyperparameter settings as follows:
lrn = setHyperPars(makeLearner("classif.ksvm"), par.vals = res$x)
lrn
#> Learner classif.ksvm from package kernlab
#> Type: classif
#> Name: Support Vector Machines; Short name: ksvm
#> Class: classif.ksvm
#> Properties: twoclass,multiclass,numerics,factors,prob,class.weights
#> Predict-Type: response
#> Hyperparameters: fit=FALSE,C=0.25,sigma=0.25
Then you can proceed as usual. Here we refit and predict the learner on the complete iris data set.
m = train(lrn, iris.task)
predict(m, task = iris.task)
#> Prediction: 150 observations
#> predict.type: response
#> threshold:
#> time: 0.01
#> id truth response
#> 1 1 setosa setosa
#> 2 2 setosa setosa
#> 3 3 setosa setosa
#> 4 4 setosa setosa
#> 5 5 setosa setosa
#> 6 6 setosa setosa
Grid search without manual discretization
We can also specify the true numeric parameter types of C
and sigma
when creating the
parameter set and use the resolution
option of makeTuneControlGrid to
automatically discretize them.
Note how we also make use of the trafo
option when creating the parameter set to easily
optimize on a log-scale.
Trafos work like this: All optimizers basically see the parameters on their original scale (from -12 to 12) in this case and produce values on this scale during the search. Right before they are passed to the learning algorithm, the transformation function is applied.
ps = makeParamSet(
makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x)
)
ctrl = makeTuneControlGrid(resolution = 3L)
rdesc = makeResampleDesc("CV", iters = 2L)
res = tuneParams("classif.ksvm", iris.task, rdesc, par.set = ps, control = ctrl)
#> [Tune] Started tuning learner classif.ksvm for parameter set:
#> Type len Def Constr Req Tunable Trafo
#> C numeric - - -12 to 12 - TRUE Y
#> sigma numeric - - -12 to 12 - TRUE Y
#> With control class: TuneControlGrid
#> Imputation value: 1
#> [Tune-x] 1: C=0.000244; sigma=0.000244
#> [Tune-y] 1: mmce.test.mean=0.527; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 2: C=1; sigma=0.000244
#> [Tune-y] 2: mmce.test.mean=0.527; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 3: C=4.1e+03; sigma=0.000244
#> [Tune-y] 3: mmce.test.mean=0.04; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 4: C=0.000244; sigma=1
#> [Tune-y] 4: mmce.test.mean=0.527; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 5: C=1; sigma=1
#> [Tune-y] 5: mmce.test.mean=0.04; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 6: C=4.1e+03; sigma=1
#> [Tune-y] 6: mmce.test.mean=0.0667; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 7: C=0.000244; sigma=4.1e+03
#> [Tune-y] 7: mmce.test.mean=0.567; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 8: C=1; sigma=4.1e+03
#> [Tune-y] 8: mmce.test.mean=0.687; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune-x] 9: C=4.1e+03; sigma=4.1e+03
#> [Tune-y] 9: mmce.test.mean=0.687; time: 0.0 min; memory: 168Mb use, 476Mb max
#> [Tune] Result: C=1; sigma=1 : mmce.test.mean=0.04
res
#> Tune result:
#> Op. pars: C=1; sigma=1
#> mmce.test.mean=0.04
Note that res$opt.path
contains the parameter values on the original scale.
as.data.frame(res$opt.path)
#> C sigma mmce.test.mean dob eol error.message exec.time
#> 1 -12 -12 0.52666667 1 NA <NA> 0.078
#> 2 0 -12 0.52666667 2 NA <NA> 0.124
#> 3 12 -12 0.04000000 3 NA <NA> 0.115
#> 4 -12 0 0.52666667 4 NA <NA> 0.290
#> 5 0 0 0.04000000 5 NA <NA> 0.064
#> 6 12 0 0.06666667 6 NA <NA> 0.091
#> 7 -12 12 0.56666667 7 NA <NA> 0.066
#> 8 0 12 0.68666667 8 NA <NA> 0.114
#> 9 12 12 0.68666667 9 NA <NA> 0.114
In order to get the transformed parameter values instead, use function trafoOptPath.
as.data.frame(trafoOptPath(res$opt.path))
#> C sigma mmce.test.mean dob eol error.message exec.time
#> 1 2.441406e-04 2.441406e-04 0.52666667 1 NA <NA> 0.078
#> 2 1.000000e+00 2.441406e-04 0.52666667 2 NA <NA> 0.124
#> 3 4.096000e+03 2.441406e-04 0.04000000 3 NA <NA> 0.115
#> 4 2.441406e-04 1.000000e+00 0.52666667 4 NA <NA> 0.290
#> 5 1.000000e+00 1.000000e+00 0.04000000 5 NA <NA> 0.064
#> 6 4.096000e+03 1.000000e+00 0.06666667 6 NA <NA> 0.091
#> 7 2.441406e-04 4.096000e+03 0.56666667 7 NA <NA> 0.066
#> 8 1.000000e+00 4.096000e+03 0.68666667 8 NA <NA> 0.114
#> 9 4.096000e+03 4.096000e+03 0.68666667 9 NA <NA> 0.114
Iterated F-Racing for mixed spaces and dependencies
The package supports a larger number of tuning algorithms, which can all be looked up and selected via TuneControl. One of the cooler algorithms is iterated F-racing from the irace package (technical description here). This not only works for arbitrary parameter types (numeric, integer, discrete, logical), but also for so-called dependent / hierarchical parameters:
ps = makeParamSet(
makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
makeDiscreteParam("kernel", values = c("vanilladot", "polydot", "rbfdot")),
makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x,
requires = quote(kernel == "rbfdot")),
makeIntegerParam("degree", lower = 2L, upper = 5L,
requires = quote(kernel == "polydot"))
)
ctrl = makeTuneControlIrace(maxExperiments = 200L)
rdesc = makeResampleDesc("Holdout")
res = tuneParams("classif.ksvm", iris.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
print(head(as.data.frame(res$opt.path)))
#> C kernel sigma degree mmce.test.mean dob eol
#> 1 3.148837 polydot NA 5 0.08 1 NA
#> 2 3.266305 vanilladot NA NA 0.02 2 NA
#> 3 -3.808213 vanilladot NA NA 0.04 3 NA
#> 4 1.694097 rbfdot 6.580514 NA 0.48 4 NA
#> 5 11.995501 polydot NA 2 0.08 5 NA
#> 6 -5.731782 vanilladot NA NA 0.14 6 NA
#> error.message exec.time
#> 1 <NA> 0.076
#> 2 <NA> 0.039
#> 3 <NA> 0.098
#> 4 <NA> 0.097
#> 5 <NA> 0.073
#> 6 <NA> 0.038
See how we made the kernel parameters like sigma
and degree
dependent on the kernel
selection parameters? This approach allows you to tune parameters of multiple kernels at once,
efficiently concentrating on the ones which work best for your given data set.
Tuning across whole model spaces with ModelMultiplexer
We can now take the following example even one step further. If we use the ModelMultiplexer we can tune over different model classes at once, just as we did with the SVM kernels above.
base.learners = list(
makeLearner("classif.ksvm"),
makeLearner("classif.randomForest")
)
lrn = makeModelMultiplexer(base.learners)
Function makeModelMultiplexerParamSet offers a simple way to contruct parameter set for tuning:
The parameter names are prefixed automatically and the requires
element is set, too,
to make all paramaters subordinate to selected.learner
.
ps = makeModelMultiplexerParamSet(lrn,
makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x),
makeIntegerParam("ntree", lower = 1L, upper = 500L)
)
print(ps)
#> Type len Def
#> selected.learner discrete - -
#> classif.ksvm.sigma numeric - -
#> classif.randomForest.ntree integer - -
#> Constr Req Tunable
#> selected.learner classif.ksvm,classif.randomForest - TRUE
#> classif.ksvm.sigma -12 to 12 Y TRUE
#> classif.randomForest.ntree 1 to 500 Y TRUE
#> Trafo
#> selected.learner -
#> classif.ksvm.sigma Y
#> classif.randomForest.ntree -
rdesc = makeResampleDesc("CV", iters = 2L)
ctrl = makeTuneControlIrace(maxExperiments = 200L)
res = tuneParams(lrn, iris.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)
print(head(as.data.frame(res$opt.path)))
#> selected.learner classif.ksvm.sigma classif.randomForest.ntree
#> 1 classif.ksvm -3.673815 NA
#> 2 classif.ksvm 6.361006 NA
#> 3 classif.randomForest NA 487
#> 4 classif.ksvm 3.165340 NA
#> 5 classif.randomForest NA 125
#> 6 classif.randomForest NA 383
#> mmce.test.mean dob eol error.message exec.time
#> 1 0.04666667 1 NA <NA> 0.168
#> 2 0.75333333 2 NA <NA> 0.153
#> 3 0.03333333 3 NA <NA> 0.406
#> 4 0.24000000 4 NA <NA> 0.095
#> 5 0.04000000 5 NA <NA> 0.253
#> 6 0.04000000 6 NA <NA> 0.263
Multi-criteria evaluation and optimization
During tuning you might want to optimize multiple, potentially conflicting, performance measures simultaneously.
In the following example we aim to minimize both, the false positive and the false negative rates (fpr and fnr). We again tune the hyperparameters of an SVM (function ksvm) with a radial basis kernel and use the sonar classification task for illustration. As search strategy we choose a random search.
For all available multi-criteria tuning algorithms see TuneMultiCritControl.
ps = makeParamSet(
makeNumericParam("C", lower = -12, upper = 12, trafo = function(x) 2^x),
makeNumericParam("sigma", lower = -12, upper = 12, trafo = function(x) 2^x)
)
ctrl = makeTuneMultiCritControlRandom(maxit = 30L)
rdesc = makeResampleDesc("Holdout")
res = tuneParamsMultiCrit("classif.ksvm", task = sonar.task, resampling = rdesc, par.set = ps,
measures = list(fpr, fnr), control = ctrl, show.info = FALSE)
res
#> Tune multicrit result:
#> Points on front: 5
head(as.data.frame(trafoOptPath(res$opt.path)))
#> C sigma fpr.test.mean fnr.test.mean dob eol
#> 1 1.052637e-01 0.003374481 0.0000000 1.00000000 1 NA
#> 2 1.612578e+02 14.303163917 0.0000000 1.00000000 2 NA
#> 3 3.697931e+03 0.026982462 0.1851852 0.06976744 3 NA
#> 4 2.331471e+02 11.791412207 0.0000000 1.00000000 4 NA
#> 5 2.078857e-02 0.010218565 0.0000000 1.00000000 5 NA
#> 6 3.382767e+02 2.187025359 0.0000000 1.00000000 6 NA
#> error.message exec.time
#> 1 <NA> 0.113
#> 2 <NA> 0.076
#> 3 <NA> 0.101
#> 4 <NA> 0.103
#> 5 <NA> 0.101
#> 6 <NA> 0.108
The results can be visualized with function plotTuneMultiCritResult. The plot shows the false positive and false negative rates for all parameter settings evaluated during tuning. Points on the Pareto front are slightly increased.
plotTuneMultiCritResult(res)
Further comments
-
Tuning works for all other tasks like regression, survival analysis and so on in a completely similar fashion.
-
In longer running tuning experiments it is very annoying if the computation stops due to numerical or other errors. Have a look at
on.learner.error
in configureMlr as well as the examples given in section Configure mlr of this tutorial. You might also want to inform yourself aboutimpute.val
in TuneControl. -
As we continually optimize over the same data during tuning, the estimated performance value might be optimistically biased. A clean approach to ensure unbiased performance estimation is nested resampling, where we embed the whole model selection process into an outer resampling loop.