- has missing values
- all numerical values
take Pandas.DataFrame as data input. Internally using Numpy Array to do the imputations.
- 1.1. drop all rows that contains missing value
- 1.2. drop all columns that contains missing value
- 1.3. using
mean
to impute all
from left to right, search for best combination of imputation. supported imputation method:
- mean
- min
- max
- zero
machine learning model
& metrics
give out the best performance.
say input data has n
dimentions, m
in that has missing value.
1. init all missing values with `mean` of the column
2. loop until the imputed values converge:
for each_col in m:
remove the imputed values in each_col, then using all other
features to impute it (a regression model trained and fit
here)
in reality, the loop is set to 30
, an empirical number.
now using the fancyimpute-mice, a relative complex method...
their method is like:
-
outer layer: get multiple imputation results (the number is a hyper parameter), and take the average as final imputated values (there is also a hyperparameter to determine how many result to be averaged, from last one).
-
for each imputation, they using:
- init all with
mean
ormedian
- to impute each column, only chose
n_nearest_columns
of this column (e.g using correlation+randomness to chose) to use - then impute the missing value, there are two options, by default "col" is used, which is Posterior Predictive Distribution; also can be "pmm"
then go for loops to run...
- init all with
diff btween iterative_regre: using n_nearest_columns
, useful when number of column is huge; using Posterior Predictive Distribution
.
now using the fancyimpute-knn, they also trigger knnimpute
according to their readme
:
Nearest neighbor imputations which weights samples using the mean squared difference on features for which two rows both have observed data.
because it will calculate the weights using mean square error, this method requires each feature of inpute data are scaled (mean 0 and variance 1).