Inferring missing network data
Inferring missing network data
Given a network up to time
, predict what new edges will occur in the future, based on observed links and nodes’ attributes in a network
D. Goldberg and F. Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Sciences, 100(8):4372, 2003.
models preferential attachment
Adamic-Adar (AA) score:
considers the number of common neighbors, giving each neighbor different weights
Methods often to fail to identify the links connecting different communities, resulting in
a poor reproduction of the topological and dynamical properties of the true network
Limitation has led to many community-based link prediction methods
P. Zhang, F. Wang, X. Wang, A. Zeng and J. Xiao. The reconstruction of complex networks with community structure. Nature Scientific Reports, 5:17287, 2015
Common Neighbor (CN) index:
directly computes the number of
overlapped neighbors between two nodes to determine their similarity
Area under the receiver operating characteristic curve (AUC) is usually used to quantify the quality of link prediction
AUC reflects the fraction of corrected links added to the network, but cannot capture whether the reconstructed network has the same or similar structural and dynamical properties as the true network
Resource allocation (RA)
Inferring hierarchical structure from network data e.g. communities
A. Clauset, C.Moore and M.E.J. Newman. Hierarchical structure and the prediction of missing
links in networks.
, 453,98–101, 2008
Fit hierarchical random graph model to observed data
Network hierarchical structure is capable of explaining a wide variety of other network features
Use phylogeny reconstruction to create a single consensus dendrogram
KronEM algorithm: Kronecker graphs used in a EM framework
Model estimates missing parts of the network. The complete model is then used to re-estimate the model and make improved inferences on the unobserved network until model parameters converge.
Kim and Leskovec, 2011. "The Network Completion Problem: Inferring Missing Nodes and Edges in Networks"
Stochastic block modeling (BM)
: assumes that each node belongs to a group and the relationship between he groups governs the connections
R. Guimer´a and M. Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. PNAS, 106(52), 2009.
Mixed membership models associate each unit of observation with multiple clusters rather
than a single cluster, via a membership probability-like vector however many assume that the data is conditionally independent
Mixed membership stochastic blockmodel
(1) For large scale networks, the model and parameter estimation must be scalable
(2) Statistical model necessary for probability distributions over missing parts of the network
(3) Is the missing data systematically missing, and if so, whether missing-ness is related to the values of observed variables?
Missing at random (MAR): probability of missing-ness is related to the observed data but not the missing ties
Missing not at random (MNAR): the probability of missing-ness is related to the unknown value of the missing ties > increases level of bias in analysis as there will be a systematic difference between respondents and non-respondents
Missing completely at random (MCAR): unrelated to missing ties and observed data
Trips produced at the origin and attracted to a destination are directly proportional to the total trip production at the origin and the total attraction at the destination
A friction factor (F) represents the reluctance on a link over various distances
Expectation Maximization (EM) framework:
Use the observed part of the network to fit a model of network structure and then estimate the missing part of the network using the model
Single imputation procedures > Schafer, J.L. and Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147-177.
Imputing from unconditional distributions
Imputing conditional means.
Imputing unconditional means
Imputing from conditional distributions
Unit non-response VS item non-response in cross sectional data
Item non-response: data on particular items are missing (i.e. particular ties or attributes)
Unit non-response: actors are completely missing (i.e. all outgoing ties and attribute scores
Examples of inferring underlying network with only information on infected individuals
S. Myers and J. Leskovec. On the convexity of latent social network inference. In NIPS ’10, 2010.
M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. In KDD'10, 2010.
an incomplete network only conatins partial knowledge of the network structure. Based on this partial knowledge is it possible to predict the structure of the unobserved part of the network.
If so what proportion of the true network structure must be captured to accurately model the unobserved structure?
For networks models in livestock population what are the most useful attributes to add certainty to the predicted model?