Creating an Imputation Method
Function makeImputeMethod allows to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following arguments:
data
is a data frame with missing values in some features.col
indicates the feature to be imputed.target
indicates the target variable(s) in a supervised learning task.
Let's have a look at function imputeMean.
imputeMean
#> function ()
#> {
#> makeImputeMethod(learn = function(data, target, col) mean(data[[col]],
#> na.rm = TRUE), impute = simpleImpute)
#> }
#> <bytecode: 0x7fa7dbbea4e8>
#> <environment: namespace:mlr>
mlr:::simpleImpute
#> function (data, target, col, const)
#> {
#> if (is.na(const))
#> stopf("Error imputing column '%s'. Maybe all input data was missing?",
#> col)
#> x = data[[col]]
#> if (is.factor(x) && const %nin% levels(x)) {
#> levels(x) = c(levels(x), as.character(const))
#> }
#> replace(x, is.na(x), const)
#> }
#> <bytecode: 0x7fa7dc5fdff8>
#> <environment: namespace:mlr>
The learn function calculates the mean of the non-missing observations in column col
.
The mean is passed via argument const
to the impute function that replaces all NA's
in feature col
.
Now let's write a new imputation method: In case of longitudinal data a frequently used technique is last observation carried forward (LOCF) where missing values are replaced by the most recent observed value.
In the R code below the learn function determines the last observed value previous to each NA
(values
)
as well as the corresponding number of consecutive NA's
(times
).
The impute function generates a vector where the entries in values
are replicated
according to times
and replaces the NA's
in feature col
.
imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
first = which(dind == 1) # position of the last observed value previous to NA
last = which(dind == -1) # position of the last of potentially several consecutive NA's
values = x[first] # observed value previous to NA
times = last - first # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}
In the following the missing values in features Ozone
and Solar.R
in the airquality data set
are imputed by LOCF.
data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10)
#> Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
#> 1 41 190 7.4 67 5 1 FALSE FALSE
#> 2 36 118 8.0 72 5 2 FALSE FALSE
#> 3 12 149 12.6 74 5 3 FALSE FALSE
#> 4 18 313 11.5 62 5 4 FALSE FALSE
#> 5 18 313 14.3 56 5 5 TRUE TRUE
#> 6 28 313 14.9 66 5 6 FALSE TRUE
#> 7 23 299 8.6 65 5 7 FALSE FALSE
#> 8 19 99 13.8 59 5 8 FALSE FALSE
#> 9 8 19 20.1 61 5 9 FALSE FALSE
#> 10 8 194 8.6 69 5 10 TRUE FALSE