You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to impute mixed numeric/categorical data within train's preProc argument? I want to impute within train's cross validation, thereby accounting for how uncertainty in imputations affects estimation of generalization error.
The ?preProcess help page suggests it is not possible to impute categorical variables:
x : a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
However, the bagImpute method can handle mixed data, in theory. The following code runs, but I am not sure if it is actually imputing the missing factor or simply removing patients with missing factor values:
library(caret);
#> Loading required package: ggplot2#> Loading required package: lattice
data(iris);
nrow(iris);
#> [1] 150iris.miss<-iris;
iris.miss[1,'Species'] <-NA;
iris.miss[2,'Petal.Length'] <-NA;
set.seed(1);
fit<- train(
Sepal.Length~.,
data=iris.miss,
method='lm',
preProc='bagImpute',
na.action=na.pass
);
fit#> Linear Regression #> #> 150 samples#> 4 predictor#> #> Pre-processing: bagged tree imputation (5) #> Resampling: Bootstrapped (25 reps) #> Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... #> Resampling results:#> #> RMSE Rsquared MAE #> 0.3176759 0.8587222 0.2604171#> #> Tuning parameter 'intercept' was held constant at a value of TRUE
Notice the printed fit says that all 150 patients were included, thus suggesting the missing factor was imputed, although I suspect that patient is simply being removed from the model and not imputed?
Is it possible to impute mixed numeric/categorical data within
train
'spreProc
argument? I want to impute withintrain
's cross validation, thereby accounting for how uncertainty in imputations affects estimation of generalization error.The
?preProcess
help page suggests it is not possible to impute categorical variables:However, the
bagImpute
method can handle mixed data, in theory. The following code runs, but I am not sure if it is actually imputing the missing factor or simply removing patients with missing factor values:Notice the printed
fit
says that all 150 patients were included, thus suggesting the missing factor was imputed, although I suspect that patient is simply being removed from the model and not imputed?Created on 2023-07-23 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: