How to deal with missing values?
data
methods
Why are certain values missing? Rubin DB (1976). "Inference and missing data"
missing-comppletey-at-random (MCAR
not-missing-at-random(NMAR)
missing-at-random(MAR)
missingness can be fully accounted for by observed variables with complete information (not missing variables)
not statistically to verify and can induce bias
EXAMPLE: males are less likely to fill in depression surveys but not because of there depression level
occurs completely random
analysis of MCAR data is unbiased but MCAR is generally rare
missingness is independent from both observable variables and unobservable parameters
aka nonignorable nonresponse
values of the variable that is missing is related to the reason why it is missing
EXAMPLE: men fail to fill in depresion survey BECAUSE of their level of depression
evaluation criteria
Root mean square error (RMSE)
Unsupervised classification error(UCE)
assessing preservation of internal structures by measuring how clustering for imputed/compete data set differs --> misclassified samples/all samples
Supervised classification error (SCE)
comparison of predicting subgroups in complete/imputed data set
EXAMPE: linear discriminat analysis(LDA) and then compare SCE=1-AUC
e.g. hierachical clustering with pearson correlation
Schmitt P, Mandel J, Guedj M (2015) A Comparison of Six Methods for
Missing Data Imputation.
Schmitt P, Mandel J, Guedj M (2015) A Comparison of Six Methods for
Missing Data Imputation.
f(r|y_obs,y_mis,0) = f(r|0)
f(r|y_obs,y_miss,0) = f(r|y_obs,0)
MAR values should not be spatially concentrated
imputing missing values
Machine Learning based approaches
Maximum Likelihood approach (Ibrahim et al. (2005))
listwise-/pairwise deletion (case deletion)
classical statistical approaches
standard polynomial regression
stochastic regression including error term
hot/cold-deck imputation (jerez,melina,2010)
k-NN (jerez,melina,2010)
naive bayesian classifier (Lin,Haug,2008)
explicit modeling of "missingness" instead of imputation (Lin,Haug,2008)
modeling of underlying distribution
Gaussian Mixture Model (Sovilj,Eirola,2015)
ANNs (Seffens et al. Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study. Bioinformatics and Biology Insights 2015:) (Richman,Trafalis,2009)
Extreme Learning Machine (Sovilj,Eirora,2015) (Huang,Liu,Yu,DE19250500000199961566 )
Decision Trees based
missForest (Stekhoven,Bühlmann, 2012) (Luo,2016) (Shah, Bartlett,2012)
Multivariate Imputation with chained equations (Luo,2016) (Shah, Bartlett,2012)
mean/median imputation Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”
DT (J48) Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”
Fuzzy unordered Rule Induction Algorithm Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”
SVM (Richmann,Trafalis, 2009) Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”
C4.5 (Lakshminarayan,1996)
unsupervised Bayesian clustering (Lakshminarayan,1996)
Multi-layer perceptron (jerez,melina,2010)
Self organizing maps (jerez,melina,2010)
EM algorithm
Amelia II,hmisc, MICE :(Colubri A, Silver T, Fradet T, Retzepi K, Fry B, Sabeti P (2016)
test for underlying missingness mechanism
Colubri A, Silver T, Fradet T, Retzepi K, Fry B, Sabeti P (2016)
MCAR test
Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198–202.
Jamshidian M, Jalal S. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. Psychometrika. 2010;75(4):649–74. pmid:21720450
feature selection
first prepossing using Mirador;Mine and then subsets of variables as inputs: Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology. 1992;45(2):265–82. doi: 10.1111/j.2044-8317.1992.tb00992.x.
click to edit
Liu, Y.; Gopalakrishnan, V. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Y.; Gopalakrishnan, V. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Y.; Gopalakrishnan, V. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
bias, coverage, width of the confidence interval, and estimated proportion
of the variance attributable to the missing data (Doove,van Buuren, recursive partinioning for missing data imputation, 2013)
click to edit