Learning Tasks

Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable.

Task types and creation

The tasks are organized in a hierarchy, with the generic Task at the top. The following tasks can be instantiated and all inherit from the virtual superclass Task:

To create a task, just call make<TaskType>, e.g., makeClassifTask. All tasks require an identifier (argument id) and a data.frame (argument data). If no ID is provided it is automatically generated using the variable name of the data. It will be later used to name results, for example of benchmark experiments, or in generated plots. Depending on the nature of the learning problem, additional arguments may be required and are discussed in the following subsections.

Regression

For supervised learning like regression (as well as classification and survival analysis) we, in addition to data, have to specify the name of the target variable.

data(BostonHousing, package = "mlbench")
regr.task = makeRegrTask(id = "bh", data = BostonHousing, target = "medv")
regr.task
#> Supervised task: bh
#> Type: regr
#> Target: medv
#> Observations: 506
#> Features:
#> numerics  factors  ordered 
#>       12        1        0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE

As you can see, the Task records the type of the learning problem and basic information about the data set, e.g., the types of the features (numeric vectors, factors or ordered factors), the number of observations, or whether missing values are present.

Creating tasks for classification problems and survival analysis follows the same scheme, the data type of the target variables included in data is simply different. For each of these learning problems, some specifics are described below.

Classification

For classification the target variable has to be a factor.

In binary classification it is customary to refer to the two classes as positive and negative class. This is for example relevant for certain performance measures like the true positive rate. By default the first factor level of the target variable is selected as the positive class.

In the following example, we define a classification task for the BreastCancer data set and exclude the variable Id from all further model fitting and evaluation.

data(BreastCancer, package = "mlbench")
df = BreastCancer
df$Id = NULL
classif.task = makeClassifTask(id = "BreastCancer", data = df, target = "Class")
classif.task
#> Supervised task: BreastCancer
#> Type: classif
#> Target: Class
#> Observations: 699
#> Features:
#> numerics  factors  ordered 
#>        0        4        5 
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 2
#>    benign malignant 
#>       458       241 
#> Positive class: benign

The positive class is benign. Class malignant can be manually selected as the positive class:

classif.task = makeClassifTask(id = "BreastCancer", data = df, target = "Class", positive = "malignant")

Cluster analysis

As cluster analysis is unsupervised, the only mandatory argument to construct a cluster analysis task is the data. Below we create a learning task from the data set mtcars.

data(mtcars, package = "datasets")
cluster.task = makeClusterTask(data = mtcars)
cluster.task
#> Unsupervised task: mtcars
#> Type: cluster
#> Observations: 32
#> Features:
#> numerics  factors  ordered 
#>       11        0        0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE

Survival analysis

Survival tasks use two target columns. For left and right censored problems these consist of the survival time and a binary event indicator. For interval censored data the two target columns must be specified in the "interval2" format (see Surv).

data(lung, package = "survival")
lung$status = (lung$status == 2) # convert to logical
surv.task = makeSurvTask(data = lung, target = c("time", "status"))
surv.task
#> Supervised task: lung
#> Type: surv
#> Target: time,status
#> Observations: 228
#> Features:
#> numerics  factors  ordered 
#>        8        0        0 
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE

The type of censoring can be specified via the argument censoring, which defaults to "rcens" for right censored data.

Cost-sensitive classification

The standard objective in classification is to obtain a high prediction accuracy, i.e., to minimize the number of errors. Thereby, all types of misclassification errors are deemed equally severe. However, in many applications different kinds of errors cause different costs.

In case of class-dependent costs, that depend on the actual and predicted class labels, it is sufficient to create an ordinary ClassifTask.

In order to handle example-specific costs it is necessary to generate a CostSensTask. In this scenario, each example (x, y) is associated with an individual cost vector of length K where K denotes the number of classes. The k-th component indicates the cost of assigning x to class k. Naturally, it is assumed that the cost of the intended class label y is minimal.

As the cost vector contains all relevant information about the intended class label y, only the feature values x and a cost matrix, which contains the cost vectors for all examples in the data set, are required to create the CostSensTask.

In the following example we use the iris data and generate an artificial cost matrix (following Beygelzimer et al., 2005):

df = iris
cost = matrix(runif(150 * 3, 0, 2000), 150) * (1 - diag(3))[df$Species,]
df$Species = NULL

costsens.task = makeCostSensTask(data = df, cost = cost)
costsens.task
#> Supervised task: df
#> Type: costsens
#> Observations: 150
#> Features:
#> numerics  factors  ordered 
#>        4        0        0 
#> Missings: FALSE
#> Has blocking: FALSE
#> Classes: 3
#> y1, y2, y3

For more details see the section about cost-sensitive classification.

Further settings

The Task help page also lists several other arguments to describe further details of the learning problem.

For example, we could include a blocking factor into the task. This would tell the task that some observations "belong together" and should not be separated when splitting into training and test sets during a resampling iteration.

Another possibility is to assign weights to observations according to their importance.

Accessing a learning task

We provide many operators to access the elements stored in a Task. For example, to access a task description you can use

getTaskDescription(regr.task)
#> $id
#> [1] "bh"
#> 
#> $type
#> [1] "regr"
#> 
#> $target
#> [1] "medv"
#> 
#> $size
#> [1] 506
#> 
#> $n.feat
#> numerics  factors  ordered 
#>       12        1        0 
#> 
#> $has.missings
#> [1] FALSE
#> 
#> $has.weights
#> [1] FALSE
#> 
#> $has.blocking
#> [1] FALSE
#> 
#> attr(,"class")
#> [1] "TaskDescRegr" "TaskDesc"

The most important operations are listed in the documentation of Task. Here are some more examples.

## Accessing the data set in classif.task
str(getTaskData(classif.task))
#> 'data.frame':    699 obs. of  10 variables:
#>  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
#>  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
#>  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
#>  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
#>  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
#>  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
#>  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
#>  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
#>  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
#>  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

## Get the number of observations in classif.task
getTaskSize(classif.task)
#> [1] 699

## Get the number of input variables in cluster.task
getTaskNFeats(cluster.task)
#> [1] 11

## Get the names of the input variables in cluster.task
getTaskFeatureNames(cluster.task)
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

## Get the names of the target columns
getTaskTargetNames(surv.task)
#> [1] "time"   "status"

## Get the values of the target variable in regr.task
head(getTaskTargets(regr.task))
#> [1] 24.0 21.6 34.7 33.4 36.2 28.7

## Get the cost matrix in costsens.task
head(getTaskCosts(costsens.task))
#>      y1        y2         y3
#> [1,]  0 1589.5664  674.44434
#> [2,]  0 1173.4364  828.40682
#> [3,]  0  942.7611 1095.33713
#> [4,]  0 1049.5562  477.82496
#> [5,]  0 1121.8899   90.85237
#> [6,]  0 1819.9830  841.06686

Note the many options that getTaskData provides to convert the data set into a convenient format. This especially comes in handy when you integrate a learner from another R package into mlr. In this regard the functions getTaskFormula and getTaskFormulaAsString are also useful.

Modifying a learning task

mlr provides several functions to alter an existing Task which is often more convenient than creating a new Task from scratch. Here are some examples.

## Select observations and features
cluster.task = subsetTask(cluster.task, subset = 4:17)

## It may happen, especially after selecting observations, that features are constant.
## These should be removed.
removeConstantFeatures(cluster.task)
#> Removing 1 columns: am
#> Unsupervised task: mtcars
#> Type: cluster
#> Observations: 14
#> Features:
#> numerics  factors  ordered 
#>       10        0        0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE

## Remove selected features
dropFeatures(surv.task, c("meal.cal", "wt.loss"))
#> Supervised task: lung
#> Type: surv
#> Target: time,status
#> Observations: 228
#> Features:
#> numerics  factors  ordered 
#>        6        0        0 
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE

## Standardize numerical features
task = normalizeFeatures(cluster.task, method = "range")
summary(getTaskData(task))
#>       mpg              cyl              disp              hp        
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.3161   1st Qu.:0.5000   1st Qu.:0.1242   1st Qu.:0.2801  
#>  Median :0.5107   Median :1.0000   Median :0.4076   Median :0.6311  
#>  Mean   :0.4872   Mean   :0.7143   Mean   :0.4430   Mean   :0.5308  
#>  3rd Qu.:0.6196   3rd Qu.:1.0000   3rd Qu.:0.6618   3rd Qu.:0.7473  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>       drat              wt              qsec              vs        
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.2672   1st Qu.:0.1275   1st Qu.:0.2302   1st Qu.:0.0000  
#>  Median :0.3060   Median :0.1605   Median :0.3045   Median :0.0000  
#>  Mean   :0.4544   Mean   :0.3268   Mean   :0.3752   Mean   :0.4286  
#>  3rd Qu.:0.7026   3rd Qu.:0.3727   3rd Qu.:0.4908   3rd Qu.:1.0000  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>        am           gear             carb       
#>  Min.   :0.5   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.5   1st Qu.:0.0000   1st Qu.:0.3333  
#>  Median :0.5   Median :0.0000   Median :0.6667  
#>  Mean   :0.5   Mean   :0.2857   Mean   :0.6429  
#>  3rd Qu.:0.5   3rd Qu.:0.7500   3rd Qu.:1.0000  
#>  Max.   :0.5   Max.   :1.0000   Max.   :1.0000

Some of these functions are explained in more detail in the data preprocessing section.

Example tasks

For your convenience mlr provides pre-defined tasks for each type of learning problem. These are used throughout this tutorial in order to get shorter and more readable code. A list of all tasks can be found here.