Integrating Another Filter Method
A lot of feature filter methods are already integrated in mlr and a complete list is given in the Appendix or can be obtained using listFilterMethods. You can easily add another filter, be it a brand new one or a method which is already implemented in another package, via function makeFilter.
Filter objects
In mlr all filter methods are objects of class Filter and are registered
in an environment called .FilterRegister
(where listFilterMethods looks them up to compile the list of available methods).
To get to know their structure let's have a closer look at the "rank.correlation"
filter
which interfaces function correls in package Rfast.
filters = as.list(mlr:::.FilterRegister)
filters$rank.correlation
#> Filter: 'rank.correlation'
#> Packages: 'Rfast'
#> Supported tasks: regr
#> Supported features: numerics
str(filters$rank.correlation)
#> List of 6
#> $ name : chr "rank.correlation"
#> $ desc : chr "Spearman's correlation between feature and target"
#> $ pkg : chr "Rfast"
#> $ supported.tasks : chr "regr"
#> $ supported.features: chr "numerics"
#> $ fun :function (task, nselect, ...)
#> - attr(*, "class")= chr "Filter"
filters$rank.correlation$fun
#> function (task, nselect, ...)
#> {
#> d = getTaskData(task, target.extra = TRUE)
#> y = Rfast::correls(d$target, d$data, type = "spearman")
#> for (i in which(is.na(y[, "correlation"]))) {
#> y[i, "correlation"] = cor(d$target, d$data[, i], use = "complete.obs",
#> method = "spearman")
#> }
#> setNames(abs(y[, "correlation"]), getTaskFeatureNames(task))
#> }
#> <bytecode: 0x7fc2c1200e40>
#> <environment: namespace:mlr>
The core element is $fun
which calculates the feature importance.
For the "rank.correlation"
filter it just extracts the data and formula from the task
and passes them on to the correls function.
Additionally, each Filter object has a $name
, which should be short and is
for example used to annotate graphics (cp. plotFilterValues), and a slightly more detailed
description in slot $desc
.
If the filter method is implemented by another package its name is given in the $pkg
member.
Moreover, the supported task types and feature types are listed.
Writing a new filter method
You can integrate your own filter method using makeFilter. This function generates a
Filter object and also registers it in the .FilterRegister
environment.
The arguments of makeFilter correspond to the slot names of the Filter
object above.
Currently, feature filtering is only supported for supervised learning tasks and possible
values for supported.tasks
are "regr"
, "classif"
and "surv"
.
supported.features
can be "numerics"
, "factors"
and "ordered"
.
fun
must be a function with at least the following formal arguments:
task
is a mlr learning Task.nselect
corresponds to the argument of generateFilterValuesData of the same name and specifies the number of features for which to calculate importance scores. Some filter methods have the option to stop after a certain number of top-ranked features have been found in order to save time and ressources when the number of features is high. The majority of filter methods integrated in mlr doesn't support this and thusnselect
is ignored in most cases. An exception is the minimum redundancy maximum relevance filter from package mRMRe....
for additional arguments.
fun
must return a named vector of feature importance values.
By convention the most important features receive the highest scores.
If you are making use of the nselect
option fun
can either return a vector of nselect
scores or a vector as long as the total numbers of features in the task filled with NAs
for all features whose scores weren't calculated.
When writing fun
many of the getter functions for Tasks come in handy,
particularly getTaskData, getTaskFormula and getTaskFeatureNames.
It's worth having a closer look at getTaskData which provides many options for
formatting the data and recoding the target variable.
As a short demonstration we write a totally meaningless filter that determines the
importance of features according to alphabetical order, i.e., giving highest scores to
features with names that come first (decreasing = TRUE
) or last (decreasing = FALSE
)
in the alphabet.
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
#> Filter: 'nonsense.filter'
#> Packages: ''
#> Supported tasks: classif,regr,surv
#> Supported features: numerics,factors,ordered
The nonsense.filter
is now registered in mlr and shown by listFilterMethods.
listFilterMethods()$id
#> [1] anova.test carscore
#> [3] cforest.importance chi.squared
#> [5] gain.ratio information.gain
#> [7] kruskal.test linear.correlation
#> [9] mrmr nonsense.filter
#> [11] oneR permutation.importance
#> [13] randomForest.importance randomForestSRC.rfsrc
#> [15] randomForestSRC.var.select rank.correlation
#> [17] relief symmetrical.uncertainty
#> [19] univariate.model.score variance
#> 23 Levels: anova.test carscore cforest.importance ... variance
You can use it like any other filter method already integrated in mlr (i.e., via the
method
argument of generateFilterValuesData or the fw.method
argument of
makeFilterWrapper; see also the page on feature selection).
d = generateFilterValuesData(iris.task, method = c("nonsense.filter", "anova.test"))
d
#> FilterValues:
#> Task: iris-example
#> name type nonsense.filter anova.test
#> 1 Sepal.Length numeric 2 119.26450
#> 2 Sepal.Width numeric 1 49.16004
#> 3 Petal.Length numeric 4 1180.16118
#> 4 Petal.Width numeric 3 960.00715
plotFilterValues(d)
iris.task.filtered = filterFeatures(iris.task, method = "nonsense.filter", abs = 2)
iris.task.filtered
#> Supervised task: iris-example
#> Type: classif
#> Target: Species
#> Observations: 150
#> Features:
#> numerics factors ordered
#> 2 0 0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 3
#> setosa versicolor virginica
#> 50 50 50
#> Positive class: NA
getTaskFeatureNames(iris.task.filtered)
#> [1] "Petal.Length" "Petal.Width"
You might also want to have a look at the source code of the filter methods already integrated in mlr for some more complex and meaningful examples.