Integrating another learner

In order to create a new learner in mlr, interface code to the R function must be written. Three functions are required for each learner. First, you must define the learner itself with a name, description, capabilities, parameters, and a few other things. Second, you need to provide a function that calls the learner function and builds the model given data. Finally, a prediction function that returns predicted values given new data is needed.

All learners should inherit from rlearner.classif, rlearner.multilabel, rlearner.regr, rlearner.surv, rlearner.costsens, or rlearner.cluster. While it is also possible to define a new type of learner that has special properties and does not fit into one of the existing schemes, this is much more advanced and not covered here.

Classification

We show how the Linear Discriminant Analysis from package MASS has been integrated into the classification learner classif.lda in mlr as an example.

Definition of the learner

The minimal information required to define a learner is the mlr name of the learner, its package, the parameter set, and the set of properties of your learner. In addition, you may provide a human-readable name, a short name and a note with information relevant to users of the learner.

First, name your learner. The naming conventions in mlr are classif.<R_method_name> for classification, multilabel.<R_method_name> for multilabel classification, regr.<R_method_name> for regression, surv.<R_method_name> for survival analysis, costsens.<R_method_name> for cost sensitive learning, and cluster.<R_method_name> for clustering. So in this example, the name starts with classif. and we choose classif.lda.

Second, we need to define the parameters of the learner. These are any options that can be set when running it to change how it learns, how input is interpreted, how and what output is generated, and so on. mlr provides a number of functions to define parameters, a complete list can be found in the documentation of LearnerParam of the ParamHelpers package.

In our example, we have discrete and numeric parameters, so we use makeDiscreteLearnerParam and makeNumericLearnerParam to incorporate the complete description of the parameters. We include all possible values for discrete parameters and lower and upper bounds for numeric parameters. Strictly speaking it is not necessary to provide bounds for all parameters and if this information is not available they can be estimated, but providing accurate and specific information here makes it possible to tune the learner much better (see the section on tuning).

Next, we add information on the properties of the learner (see also the section on learners). Which types of features are supported (numerics, factors)? Are case weights supported? Are class weights supported? Can the method deal with missing values in the features and deal with NA's in a meaningful way (not na.omit)? Are one-class, two-class, multi-class problems supported? Can the learner predict posterior probabilities?

If the learner supports class weights the name of the relevant learner parameter can be specified via argument class.weights.param.

Below is the complete code for the definition of the LDA learner. It has one discrete parameter, method, and two continuous ones, nu and tol. It supports classification problems with two or more classes and can deal with numeric and factor explanatory variables. It can predict posterior probabilities.

makeRLearner.classif.lda = function() {
  makeRLearnerClassif(
    cl = "classif.lda",
    package = "MASS",
    par.set = makeParamSet(
      makeDiscreteLearnerParam(id = "method", default = "moment", values = c("moment", "mle", "mve", "t")),
      makeNumericLearnerParam(id = "nu", lower = 2, requires = quote(method == "t")),
      makeNumericLearnerParam(id = "tol", default = 1e-4, lower = 0),
      makeDiscreteLearnerParam(id = "predict.method", values = c("plug-in", "predictive", "debiased"),
        default = "plug-in", when = "predict"),
      makeLogicalLearnerParam(id = "CV", default = FALSE, tunable = FALSE)
    ),
    properties = c("twoclass", "multiclass", "numerics", "factors", "prob"),
    name = "Linear Discriminant Analysis",
    short.name = "lda",
    note = "Learner param 'predict.method' maps to 'method' in predict.lda."
  )
}

Creating the training function of the learner

Once the learner has been defined, we need to tell mlr how to call it to train a model. The name of the function has to start with trainLearner., followed by the mlr name of the learner as defined above (classif.lda here). The prototype of the function looks as follows.

function(.learner, .task, .subset, .weights = NULL, ...) { }

This function must fit a model on the data of the task .task with regard to the subset defined in the integer vector .subset and the parameters passed in the ... arguments. Usually, the data should be extracted from the task using getTaskData. This will take care of any subsetting as well. It must return the fitted model. mlr assumes no special data type for the return value -- it will be passed to the predict function we are going to define below, so any special code the learner may need can be encapsulated there.

For our example, the definition of the function looks like this. In addition to the data of the task, we also need the formula that describes what to predict. We use the function getTaskFormula to extract this from the task.

trainLearner.classif.lda = function(.learner, .task, .subset, .weights = NULL, ...) {
  f = getTaskFormula(.task)
  MASS::lda(f, data = getTaskData(.task, .subset), ...)
}

Creating the prediction method

Finally, the prediction function needs to be defined. The name of this function starts with predictLearner., followed again by the mlr name of the learner. The prototype of the function is as follows.

function(.learner, .model, .newdata, ...) { }

It must predict for the new observations in the data.frame .newdata with the wrapped model .model, which is returned from the training function. The actual model the learner built is stored in the $learner.model member and can be accessed simply through .model$learner.model.

For classification, you have to return a factor of predicted classes if .learner$predict.type is "response", or a matrix of predicted probabilities if .learner$predict.type is "prob" and this type of prediction is supported by the learner. In the latter case the matrix must have the same number of columns as there are classes in the task and the columns have to be named by the class names.

The definition for LDA looks like this. It is pretty much just a straight pass-through of the arguments to the predict function and some extraction of prediction data depending on the type of prediction requested.

predictLearner.classif.lda = function(.learner, .model, .newdata, predict.method = "plug-in", ...) {
  p = predict(.model$learner.model, newdata = .newdata, method = predict.method, ...)
  if (.learner$predict.type == "response") 
    return(p$class) else return(p$posterior)
}

Regression

The main difference for regression is that the type of predictions are different (numeric instead of labels or probabilities) and that not all of the properties are relevant. In particular, whether one-, two-, or multi-class problems and posterior probabilities are supported is not applicable.

Apart from this, everything explained above applies. Below is the definition for the earth learner from the earth package.

makeRLearner.regr.earth = function() {
  makeRLearnerRegr(
    cl = "regr.earth",
    package = "earth",
    par.set = makeParamSet(
      makeLogicalLearnerParam(id = "keepxy", default = FALSE, tunable = FALSE),
      makeNumericLearnerParam(id = "trace", default = 0, upper = 10, tunable = FALSE),
      makeIntegerLearnerParam(id = "degree", default = 1L, lower = 1L),
      makeNumericLearnerParam(id = "penalty"),
      makeIntegerLearnerParam(id = "nk", lower = 0L),
      makeNumericLearnerParam(id = "thres", default = 0.001),
      makeIntegerLearnerParam(id = "minspan", default = 0L),
      makeIntegerLearnerParam(id = "endspan", default = 0L),
      makeNumericLearnerParam(id = "newvar.penalty", default = 0),
      makeIntegerLearnerParam(id = "fast.k", default = 20L, lower = 0L),
      makeNumericLearnerParam(id = "fast.beta", default = 1),
      makeDiscreteLearnerParam(id = "pmethod", default = "backward",
        values = c("backward", "none", "exhaustive", "forward", "seqrep", "cv")),
      makeIntegerLearnerParam(id = "nprune")
    ),
    properties = c("numerics", "factors"),
    name = "Multivariate Adaptive Regression Splines",
    short.name = "earth",
    note = ""
  )
}
trainLearner.regr.earth = function(.learner, .task, .subset, .weights = NULL, ...) {
  f = getTaskFormula(.task)
  earth::earth(f, data = getTaskData(.task, .subset), ...)
}
predictLearner.regr.earth = function(.learner, .model, .newdata, ...) {
  predict(.model$learner.model, newdata = .newdata)[, 1L]
}

Again most of the data is passed straight through to/from the train/predict functions of the learner.

Survival analysis

For survival analysis, you have to return so-called linear predictors in order to compute the default measure for this task type, the cindex (for .learner$predict.type == "response"). For .learner$predict.type == "prob", there is no substantially meaningful measure (yet). You may either ignore this case or return something like predicted survival curves (cf. example below).

There are three properties that are specific to survival learners: "rcens", "lcens" and "icens", defining the type(s) of censoring a learner can handle -- right, left and/or interval censored.

Let's have a look at how the Cox Proportional Hazard Model from package survival has been integrated into the survival learner surv.coxph in mlr as an example:

makeRLearner.surv.coxph = function() {
  makeRLearnerSurv(
    cl = "surv.coxph",
    package = "survival",
    par.set = makeParamSet(
      makeDiscreteLearnerParam(id = "ties", default = "efron", values = c("efron", "breslow", "exact")),
      makeLogicalLearnerParam(id = "singular.ok", default = TRUE),
      makeNumericLearnerParam(id = "eps", default = 1e-09, lower = 0),
      makeNumericLearnerParam(id = "toler.chol", default = .Machine$double.eps^0.75, lower = 0),
      makeIntegerLearnerParam(id = "iter.max", default = 20L, lower = 1L),
      makeNumericLearnerParam(id = "toler.inf", default = sqrt(.Machine$double.eps^0.75), lower = 0),
      makeIntegerLearnerParam(id = "outer.max", default = 10L, lower = 1L),
      makeLogicalLearnerParam(id = "model", default = FALSE, tunable = FALSE),
      makeLogicalLearnerParam(id = "x", default = FALSE, tunable = FALSE),
      makeLogicalLearnerParam(id = "y", default = TRUE, tunable = FALSE)
    ),
    properties = c("missings", "numerics", "factors", "weights", "prob", "rcens"),
    name = "Cox Proportional Hazard Model",
    short.name = "coxph",
    note = ""
  )
}
trainLearner.surv.coxph = function(.learner, .task, .subset, .weights = NULL, ...) {
  f = getTaskFormula(.task)
  data = getTaskData(.task, subset = .subset)
  if (is.null(.weights)) {
    mod = survival::coxph(formula = f, data = data, ...)
  } else {
    mod = survival::coxph(formula = f, data = data, weights = .weights, ...)
  }
  mod
}
predictLearner.surv.coxph = function(.learner, .model, .newdata, ...) {
  if (.learner$predict.type == "response") {
    predict(.model$learner.model, newdata = .newdata, type = "lp", ...)
  }
}

Clustering

For clustering, you have to return a numeric vector with the IDs of the clusters that the respective datum has been assigned to. The numbering should start at 1.

Below is the definition for the FarthestFirst learner from the RWeka package. Weka starts the IDs of the clusters at 0, so we add 1 to the predicted clusters. RWeka has a different way of setting learner parameters; we use the special Weka_control function to do this.

makeRLearner.cluster.FarthestFirst = function() {
  makeRLearnerCluster(
    cl = "cluster.FarthestFirst",
    package = "RWeka",
    par.set = makeParamSet(
      makeIntegerLearnerParam(id = "N", default = 2L, lower = 1L),
      makeIntegerLearnerParam(id = "S", default = 1L, lower = 1L),
      makeLogicalLearnerParam(id = "output-debug-info", default = FALSE, tunable = FALSE)
    ),
    properties = c("numerics"),
    name = "FarthestFirst Clustering Algorithm",
    short.name = "farthestfirst"
  )
}
trainLearner.cluster.FarthestFirst = function(.learner, .task, .subset, .weights = NULL, ...) {
  ctrl = RWeka::Weka_control(...)
  RWeka::FarthestFirst(getTaskData(.task, .subset), control = ctrl)
}
predictLearner.cluster.FarthestFirst = function(.learner, .model, .newdata, ...) {
  as.integer(predict(.model$learner.model, .newdata, ...)) + 1L
}

Multilabel classification

As stated in the multilabel section, multilabel classification methods can be divided into problem transformation methods and algorithm adaptation methods.

At this moment the only problem transformation method implemented in mlr is the binary relevance method. Integrating more of these methods requires good knowledge of the architecture of the mlr package.

The integration of an algorithm adaptation multilabel classification learner is easier and works very similar to the normal multiclass-classification. In contrast to the multiclass case, not all of the learner properties are relevant. In particular, whether one-, two-, or multi-class problems are supported is not applicable. Furthermore the prediction function output must be a matrix with each prediction of a label in one column and the names of the labels as column names. If .learner$predict.type is "response" the predictions must be logical. If .learner$predict.type is "prob" and this type of prediction is supported by the learner, the matrix must consist of predicted probabilities.

Below is the definition of the rFerns learner from the rFerns package, which does not support probability predictions.

makeRLearner.multilabel.rFerns = function() {
  makeRLearnerMultilabel(
    cl = "multilabel.rFerns",
    package = "rFerns",
    par.set = makeParamSet(
      makeIntegerLearnerParam(id = "depth", default = 5L),
      makeIntegerLearnerParam(id = "ferns", default = 1000L)
    ),
    properties = c("numerics", "factors", "ordered"),
    name = "Random ferns",
    short.name = "rFerns",
    note = ""
  )
}
trainLearner.multilabel.rFerns = function(.learner, .task, .subset, .weights = NULL, ...) {
  d = getTaskData(.task, .subset, target.extra = TRUE)
  rFerns::rFerns(x = d$data, y = as.matrix(d$target), ...)
}
predictLearner.multilabel.rFerns = function(.learner, .model, .newdata, ...) {
  as.matrix(predict(.model$learner.model, .newdata, ...))
}