Never mind, I figured it out [Previous post: Imputation using "ignore" is failing sanity check] #474

KenleyData · 2022-02-27T04:11:47Z

KenleyData
Feb 27, 2022

Hi,

I am using MICE to impute missing data for a large dataset that we split into training and test cohorts (for machine learning purposes). We want to train an imputation model using the training cohort only, and then apply that imputation model to both the training and test cohorts. I will call the training cohort the "Early Cohort" and the test cohort the "Late Cohort" for clarity.

I have been using the "ignore" parameter to ignore the cases in the Late Cohort (setting the Late Cohort cases to a value of TRUE in the boolean vector), and the code seems to run fine. The imputed results look reasonable in their distribution. But, when I was doing debugging and sanity checking of my code, I decided to confirm that I was using the "ignore" parameter correctly by running a second mice imputation that uses ONLY the Early Cohort as input (does not see the Late Cohort at all), and does not use the "ignore" parameter at all. The idea of this sanity check was that I would expect the imputed values for the Early Cohort be the same with either approach (if I set the random seed equally), since in both cases the imputation is trained on the Early Cohort and applied to the Early Cohort. But, I am getting completely different results. Still reasonable, but different.

Ideally I would like to use predictive mean matching. Setting donors=1L to try to remove randomness did not help, and trying method="norm" did not help. I feel like I must be misunderstanding something basic about how "ignore" functions in the training and application of the imputation model. I have pasted my code below. Basically, I expected the "early_cohort" data frame to be equal to the "test_filled" data frame.

Thanks very much for any advice!

MY CODE
#Join early and late cohorts into one data frame
#The early cohort has 27,606 cases and the late cohort has 5,158 cases
total_cases <- rbind(early_cohort_miss,late_cohort_miss)

#Create boolean vector that has TRUE for cases to be ignored in creating the imputation model (everything in the late cohort)
#Vector of 5158 FALSE values
test_vecF <- logical(length=5158)
#Convert to TRUE values
test_vec <- !test_vecF
#Vector of 27606 FALSE values
train_vec <- logical(length=27606)
#Create one boolean vector to insert into mice ignore function
bool_vec <- append(train_vec,test_vec,after=length(train_vec))

#Call MICE, with bool_vec telling the algorithm to use only the early cohort data for training the imputation algorithm
imp.ignore <- mice(total_cases, m = 1, maxit = 20, ignore = bool_vec, printFlag=FALSE, seed=1)

#Fill in missing values
total <- complete(imp.ignore)

#Separate back into early and late cohorts
early_cohort <- total[1:27606,]
late_cohort <- total[27607:32764,]

#SANITY CHECKING CODE
early_sanity <- total_cases[1:27606,]
test_imp <- mice(early_sanity, m = 1, maxit = 20, printFlag=FALSE, seed=1)

#Fill in missing values
test_filled <- complete(test_imp)

KenleyData · 2022-02-28T18:19:59Z

KenleyData
Feb 28, 2022
Author

I can't figure out how to delete this, but wanted to note that I figured it out (basically, there were two ways in which I was missing how random numbers were being used by the algorithm). Sorry to clutter the board!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Never mind, I figured it out [Previous post: Imputation using "ignore" is failing sanity check] #474

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Never mind, I figured it out [Previous post: Imputation using "ignore" is failing sanity check] #474

KenleyData Feb 27, 2022

Replies: 1 comment

KenleyData Feb 28, 2022 Author

KenleyData
Feb 27, 2022

KenleyData
Feb 28, 2022
Author