Data Preprocessing

mlr offers several options for data preprocessing.

Some of the following simple methods were already mentioned in section Learning tasks:

Moreover, distinct sections in this tutorial are devoted to

Additionally, mlr permits to fuse a Learner with any preprocessing method of your choice like any kind of data transformation, normalization, dimensionality reduction or outlier removal.

Fusing a learner with data preprocessing

A Learner can be coupled with a preprocessing method by function makePreprocWrapper.

As described in the section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and prediction and are completely defined by the user.

The specified preprocessing steps then "belong" to the wrapped Learner. In contrast to the preprocessing options listed above like normalizeFeatures

Let's see how to create a preprocessing wrapper using the following simple example: Some learning methods as e.g. k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner by coupling it with function scale.

Specifying the train method

The train function has to be a function with the following arguments:

It must return a list with elements $data and $control, where $data is the preprocessed data set and $control stores all information required to preprocess the data before prediction.

The train function for the scaling example is given below. It calls scale on the numerical features and returns the scaled training data and the corresponding scaling parameters. args contains the center and scale arguments of function scale and slot $control stores the scaling parameters.

tr.fun = function(data, target, args = list(center, scale)) {
  cns = colnames(data)
  nums = setdiff(cns[sapply(data, is.numeric)], target)
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = args$center, scale = args$scale)
  ctrl = args
  if (is.logical(ctrl$center) && ctrl$center)
    ctrl$center = attr(x, "scaled:center")
  if (is.logical(ctrl$scale) && ctrl$scale)
    ctrl$scale = attr(x, "scaled:scale")
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(list(data = data, control = ctrl))
}

Specifying the predict method

The predict function has the following arguments:

It returns the preprocessed data.

In our running example the predict function scales the numerical features using the parameters from the training stage stored in control.

pr.fun = function(data, target, args, control) {
  cns = colnames(data)
  nums = cns[sapply(data, is.numeric)]
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = control$center, scale = control$scale)
  data = data[, setdiff(cns, nums), drop = FALSE]  
  data = cbind(data, as.data.frame(x))
  return(data)
}

Creating the preprocessing wrapper

Below we create a preprocessing wrapper with a regression neural network (which itself does not have a scaling option) as base learner.

The train and predict functions defined above are passed to makePreprocWrapper via the train and predict arguments. par.vals is a list of parameter values that is relayed to the args argument of the train function.

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = tr.fun, predict = pr.fun,
  par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name: 
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,decay=0.01

Let's compare the cross-validated mean squared error (mse) on the Boston Housing data set with and without scaling.

rdesc = makeResampleDesc("CV", iters = 10)

r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r$aggr
#> mse.test.mean 
#>      20.98447

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r$aggr
#> mse.test.mean 
#>      54.37792