How to deal with missing values?

data

methods

Why are certain values missing? Rubin DB (1976). "Inference and missing data"

missing-comppletey-at-random (MCAR

not-missing-at-random(NMAR)

missing-at-random(MAR)

missingness can be fully accounted for by observed variables with complete information (not missing variables)

not statistically to verify and can induce bias

EXAMPLE: males are less likely to fill in depression surveys but not because of there depression level

occurs completely random

analysis of MCAR data is unbiased but MCAR is generally rare

missingness is independent from both observable variables and unobservable parameters

aka nonignorable nonresponse

values of the variable that is missing is related to the reason why it is missing

EXAMPLE: men fail to fill in depresion survey BECAUSE of their level of depression

evaluation criteria

Root mean square error (RMSE)

Unsupervised classification error(UCE)

assessing preservation of internal structures by measuring how clustering for imputed/compete data set differs --> misclassified samples/all samples

Supervised classification error (SCE)

comparison of predicting subgroups in complete/imputed data set

EXAMPE: linear discriminat analysis(LDA) and then compare SCE=1-AUC

e.g. hierachical clustering with pearson correlation

Schmitt P, Mandel J, Guedj M (2015) A Comparison of Six Methods for
Missing Data Imputation.

f(r|y_obs,y_mis,0) = f(r|0)

f(r|y_obs,y_miss,0) = f(r|y_obs,0)

MAR values should not be spatially concentrated

imputing missing values

Machine Learning based approaches

Maximum Likelihood approach (Ibrahim et al. (2005))

listwise-/pairwise deletion (case deletion)

classical statistical approaches

standard polynomial regression

stochastic regression including error term

hot/cold-deck imputation (jerez,melina,2010)

k-NN (jerez,melina,2010)

naive bayesian classifier (Lin,Haug,2008)

explicit modeling of "missingness" instead of imputation (Lin,Haug,2008)

modeling of underlying distribution

Gaussian Mixture Model (Sovilj,Eirola,2015)

ANNs (Seffens et al. Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study. Bioinformatics and Biology Insights 2015:) (Richman,Trafalis,2009)

Extreme Learning Machine (Sovilj,Eirora,2015) (Huang,Liu,Yu,DE19250500000199961566 )

Decision Trees based

missForest (Stekhoven,Bühlmann, 2012) (Luo,2016) (Shah, Bartlett,2012)

Multivariate Imputation with chained equations (Luo,2016) (Shah, Bartlett,2012)

mean/median imputation Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”

DT (J48) Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”

Fuzzy unordered Rule Induction Algorithm Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”

SVM (Richmann,Trafalis, 2009) Rahman, M. M. and Davis, D. N. (2013) “Machine Learning-Based Missing Value Imputation Method for Clinical Datasets”

C4.5 (Lakshminarayan,1996)

unsupervised Bayesian clustering (Lakshminarayan,1996)

Multi-layer perceptron (jerez,melina,2010)

Self organizing maps (jerez,melina,2010)

EM algorithm

Amelia II,hmisc, MICE :(Colubri A, Silver T, Fradet T, Retzepi K, Fry B, Sabeti P (2016)

test for underlying missingness mechanism

Colubri A, Silver T, Fradet T, Retzepi K, Fry B, Sabeti P (2016)

MCAR test
Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198–202.

Jamshidian M, Jalal S. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. Psychometrika. 2010;75(4):649–74. pmid:21720450

feature selection

first prepossing using Mirador;Mine and then subsets of variables as inputs: Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology. 1992;45(2):265–82. doi: 10.1111/j.2044-8317.1992.tb00992.x.

click to edit

Liu, Y.; Gopalakrishnan, V. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.

bias, coverage, width of the confidence interval, and estimated proportion
of the variance attributable to the missing data (Doove,van Buuren, recursive partinioning for missing data imputation, 2013)

click to edit