Please enable JavaScript.
Coggle requires JavaScript to display documents.
02 Intensive Week, Intensive Week - Lecture 14: Record Linkage Overview…
02 Intensive Week
Intensive Week - Lecture 14: Record Linkage Overview and Blocking/indexing (1)
Record Linkage
Also known as data linkage, data matching, entity resolution, duplicate detection, object identification, etc;
Major challenge is unique entity identifiers not available in the databases to be linked;
Record linkage techniques:
Deterministic matching;
Probalistics record linkage;
Computer science approaches:
based on machine learning, data mining, database, info retrieval techniques;
Supervised classification (监督式分类): requires true matches training data
Unsupervised: clustering, collective, and graph based
Record Linkage Challenges
No unique entity identifiers available;
Dirty real world data;
Scalability:
Naive comparison of all record pairs has a quadratic complexity;
Remove likely non-matches as efficiently as possible;
No training data in many linkage applications. No record pairs with known true match status;
Privacy and confidentiality;
Data Pre-processing for Record Linkage
Important to ensure the
databases to be linked have similar structure and content
;
Real world data are often dirty:
typographical and other errors;
Different coding schemes;
Missing values;
Data changing over time;
Names and addresses are especially easy to have data entry errors:
scanned, hand-written, recorded ove telelphone, and manually typed;
Different information provided by same person;
Spelling variations: 'Gail', ' Gayle', 'Dixon', Dickson'
Data Pre-procssing Steps
Clean input:
Remove unwanted characters and words ('missing', n/a')
Expand abbreviations and correct missspellings, 'Wolllongonng' ->'Wollongong', 'St'->'Street'
Segment names (first name, middle, last) and addresses into well defined output fields;
Verify correctness if possible:
do gender and first name match;
does an address exists;
Why Blocking:
The number of record pair comparisons equals the product of the sizes of the two data sets.
e.g. data sets of 1 million and 5 million has a record pair comparison of 1 trillon.
Performance bottleneck in a record linkage system is usually the (expensive)
detailed comparison of attribute (field) values
between record pairs;
Blocking / indexing / searching / filtering techniques are used to reduce the large amount of record pair comparisons
.
Aim of blocking
: Cheaply remove candidate record pairs which are obviously not matches (i.e. are non-matches)
6.Traditional Blocking:
Traditional blocking compares key blocking attributes to find same values;
One or more attributes in the database is selected as a blocking key;
All records with the same
blocking key value (BKV)
will be inserted into the same block;
Candidate record pairs will be formed from block where their BKV occurs in both databases;
Blocking key selection
is a balance of the quality of the candidate record pairs generated (how many true matches included) with the size of the blocks generated (the number of resulting candidate record pairs).
Problems with traditional blocking:
a wrong value in a blocking key variable results in a record being inserted into the wrong block;
changed attribute can cause true matching record pairs are potentially missed (e.g. surname and postcode);
Missing values mean BKV cannot be generated;
frequency distribution of BKVs influence number of record pairs generated;
Attributes Selected as Blocking Keys
Not change over time (i.e. be constant);
Be accurate (no errors or variations in them);
Be complete (no missing values);
Have a frequency distribution close to uniform;
Have a reasonable number of unique different values;
Intensive Week - Lecture 20 Record Linkage Evaluation
Evaluating the Record Linkage Process
Different techniques are available for each step of the record linkage process (cleaning and standardisation, blocking, comparison, and classification);
When employing a record linkage system, one wants to get the best possible results within operational constraints (
linkage time, computational resources, minimum linkage quality, available software and human expertise, etc.
);
Measures are required to evaluate the two main aspects of record linkage:
Linkage quality
(effectiveness)
Linkage complexity
(efficiency)
Measuring Linkage Quality
Achieving high linkage quality is a main
goal of most record linkage projects / applications
;
Ground truth data
is needed to measure linkage quality:
A set of true matching record pairs
A set of true non-matching record pairs
How to obtain
such ground truth data?
Results of a previous linkage
Manual clerical review (more in the next lecture) or manually classified (sampled) record pairs
Contact (call) all individuals in the databases and ask them?
How confident can one be such ground truth data is always correct?
2.1 Various difficulties with manually prepared ground truth data:
It is easy to classify record pairs that have totally different attribute values
as non-matches
It is (generally) easy to classify record pairs that are very similar as matches
(but what about twins, or if not enough information is available?)
It is difficult to classify record pairs where some attribute values are the
same/similar while others are different
Studies have shown that manual classification of record pairs is never
100% correct
Domain expertise is often required (such as knowledge about names and
their origins / cultures)
If randomly sampled, most record pairs will be non-matches
Measuring Linkage Quality with Ground Truth
record pairs into matches and non-matches has four possible
outcomes:
True positives (TP)
: True matches correctly classified as matches
(correct matches);
False negatives (FN)
: True matches incorrectly classified as non-matches
(false non-matches);
True negatives (TN)
: True non-matches correctly classified as non-
matches (correct non-matches);
False positives (FP)
: True non-matches incorrectly classified as matches
(false matches);
3.1 Measuring Linkage Quality with Ground Truth
Due to the quadratic comparison space, the number of true non-matches is usually much larger than the number of true matches
– As the number of records increases, the
number of true matches increases linearly
while the
number of possible record pairs increases quadratically
;
This holds even after blocking
Error or Confusion Matrix
Include:
true matches;
false matches;
false non-matches;
true non-matches;
Based on the values in the four cells of the error/confusion matrix, different linkage quality measures can be defined;
These
binary classification measures
used in other domains:
machine learning and data mining;
information retrieval(web search);
medical tests;
security;
There is a
trade off between false positive and false negative
;
Linkage Quality
5.2. Precision
Widely used in information retrieval to assess the quality of search results. (how many docs retrieved in a query are relevant?)
For record linkage, it measures how many of the classified matches are true matches;
Precision = TP/(TP + FP);
5.3. Recall
Widely used in information retrival to assess the quality of search results (How many relevant records are retrieved in a query). i.e. how many classified matches are true matches.
Reca = TP/(TP+FN)
5.4. F-measure: combining precision and recall
Precision and recall are often combined into the F-measure (or F-score):
fmeas = 2 x (prec x reca) / (prec + reca)
It is the harmonic mean of precision and recall;
It finds the 'best' combination of precision and recall;
As precision goes up, recall goes down and vice-versa;
However, recent research showed that F-measures can be mis-leading;
5.1 Accuracy
acc = (TP + TN) / (TP + FP + FN + TN)
widely used in machine learning and data mining;
consider both true positives and true negatives;
This is not a good measure if TN is very high.
Visualising Linkage Quality Results
Change of classifer, and change in classifer parameter will produce a different error matrix;
lowering a classification threshold, t
, will usually
increase the numbers of TP, but also FP. Lower the number of TN and FN
.
raising a classification threshold leads to the opposite
;
For Precision-Recall, the area under the curve is higher is better;
Plots are useful tools
for understand and comare classifiers, e.g. different classification thresholds);
F-measure graph shows precision, recall and f-measure
for different classifier thresholds;
precision-recall graph
shows precision vs recall for different classifier thresholds;
ROC (receiver operating curve)
graph shows true positive rate versus false positive rate for different classifier thresholds;
Measuring Linkage Complexity
Run-time and memory consumption of a
linkage program / system can be easily measured;
Generally,
platform independent measure
are better:
Allows the performance of systems to be compared even when not run on the same computing platform (but same datasets and same parameter settings)
Linkage complexity is generally measured by the number of record pairs that need to be compared:
number of
candidate record pairs
generated by blocking;
10.1 Reduction Ratio
Measures by how much a blocking technique is able to reduce the comparison space;
Compared to the full pair-wise comparison of all record pairs:
rr = 1 – (sM + sN) / (nM + nN)
sM and sN are the number of true matching and non-matching candidate record pairs generated by a blocking technique;
nM and nN are the total number of true matching and non-matching record
pairs (in the pair-wise comparison space);
10.2 Pairs Completeness
Measures how many true matches 'pass' through a blocking process;
pc = sM / nM
It requires the truth match status of all record pairs (as with the linkage quality measures);
Need to get a high Pairs completeness for assignment
10.3 Pair Quality
Measures how many candidate record pairs generated by blocking are true matches;
It corresponds to the precision of blocking
pq = sM / (sM + sN )
It requires the truth match status of all record pairs (as with the linkage quality measures)
Intensive Week - Lecture 17 Record Comparison
Comparing Strings for Record Linkage
Strings are most commonly used attributes (field) when comparing records;
Names;
Addresses;
Tele numbers;
The aim is to calculate a normalised similarity between two strings 0<=sim
approx
<=1
Q-gram Based String Comparison
Convert a string into a set of q-grams:
q = 2 (bigrams), 'peter' = ['pe', 'et', 'te','er']
Calculate similarity between two strings based on counting the number of q-grams:
Jaccard similarity
:
sim
Jacc
(S1,S2)=|intersection(Q1,Q2)|/|union(Q1,Q2)|
Dice Coefficient
:
simDice(S1,S2)=2*|intersection(Q1,Q2)|/(|Q1|+|Q2|)
To compare similarity, same method need to be used in the dataset.
Set {} is used for the similarity calculation, element occur only once.
Qx is the set of q-grams extracted from string sx
intersection(Q1,Q2) is the set of q-grams that occur in both strings
|..| denotes the number of elements in a set
Edit Distance
Count how many edit operations needed to convert one string to another
.
Insertion of a character;
Deletion of a character;
Substitution of a character;
Transpositions of tow adjacent characters;
Convert an edit distance into a similarity 0 < = sim
edit_dist
<=1
sim
edit_dist
(s1,s2)=1-edit_dist(s1,s2)/max(len(s1),len(s2))
Main drawback of edit distance is its quadratic complexity in the lengths of the two strings, i.e. len(s1) * len(s2) computational steps;
Matrix shows the number of edits between sub-strings.
Bag Distance
A fast approximation of edit distance is bag distance
;
Defined as:
bag_dist(s1,s2) = max(|x - y|, |y - x|)
,
where x = bag(s1) and y = bag(s2), then bag distance is then divided by the length of the longer string to get a similarity measure between 0.0 and 1.0
It has been shown that always: bag_dist(s1, s2 ) ≤ edit_dist(s1, s2 ),
therefore: simbag_dist(s1, s2 ) ≥ simedit_dist(s1, s2 )
If simbag_dist(s1, s2 ) is below a threshold then edit distance does not need to be calculated
Jaro-Winkler String Comparison
Developed by the US census Bureau to comare personal name strings (first and last names);
A combination of q-gram and edit distance sting comparison
;
Basic idea of the Jaro comparison function:
Count c, the number of agreeing (common) characters within distance d.
Count t, the number of transposed characters (‘pe’ versus ‘ep’) in
the set of common strings
simJaro(s1, s2 ) = (c / len(s1) + c / len(s2) + (c – t) / c) / 3
Increase similarity if the first few characters (p, with p ≤ 4) are the same:
simJaro_Winkler(s1, s2 ) = = simJaro(s1, s2 ) + (1 - simJaro(s1, s2 )) * p/10
Statistical linkage key SLK-581
Developed b Australian Institute for Health and Welfare;
Aims to identify records likely correspond to the same person
;
“marie miller”, 13/04/1991, “f” → “ile ar 13041991 2”- Need name, birthday and gender
;
With the encryption function, the information can be retrieved.
Intensive Week - Lecture 18: Record Pair Classification (2)
Classifying Record Pairs
The comparison step generates one vector of similarities
(also known as weight vector) for each of compared record pair.
The elements of such vectors are the calculated similarities
(exact or approximate);
Classifying record pairs can be based on (a) summing the calculated similarities into a single similarity value, or (b) using the full vector of similarities
Threshold Based Classification
One threshold t: 0 ≤ t ≤ simmax, where simmax is equal to the
number of similarities in the vectors:
(a) Record pairs with a similarity of at least t → Classified match
(b) Record pairs with a similarity below t → Classified non-match
Two thresholds tl and tu: 0 ≤ tl < tu ≤ simmax:
(a) Record pairs with a similarity of at least tu → Classified match
(b) Record pairs with a similarity below tl → Classified non-match
(c) Record pairs with a similarity between tl and tu → Classified
potential match
2.1 Threshold Based Classification
If similarities are simply summed then each attribute has the same importance (or same weight)
Total similarity is then a weighted sum:
sim(reci, recj) = Σa sim(reci[a], recj[a]) * wa
wa = log(number of unique attribute values), where wa is the weight for attribute a
To normalise this similarity into 0..1 we can divide
sim(reci, recj) by Σa wa
Probabilistic Classification
Known as
probabilistic record linkage
;
Basic idea:
Compare common record attributes (or fields) using approximate
(string) comparison functions;
Calculate matching weights based on frequency ratios (global or value
specific ratios) and error estimates;
Sum of the matching weights is used to classify a pair of records as a
match, non-match, or potential match (using two thresholds);
Problems: Estimating errors, find optimal thresholds, assumption
of independence, and manual clerical review
3.1 Probabilistic Classification
A ratio R is calculated for each compared record pair r = (a,b) in
the product space A × B:
R = P (γ ∈ Γ | r ∈ M ) / P (γ ∈ Γ | r ∈ U )
M sets of true matches;
U sets of non-matches;
γ (gamma) is an agreement pattern
Γ (Gamma) is the comparison space;
A × B = {(a, b) : a ∈ A, b ∈ B} for files (data sets) A and B
M = {(a, b) : a = b, a ∈ A, b ∈ B} True matches
U = {(a, b) : a ≠ b, a ∈ A, b ∈ B} True non-matches
3.2 Probabilistic Classification:
Fellegi and Sunter proposed the following decision rule:
R ≥ tu → r is classified as a match
tl < R < tu → r is classified as a potential match
R ≤ tl → r is classified as a non-match
Assuming conditional independence between attributes allows to
calculate individual attribute-wise probabilities:
mi = P([ai = bi , a ∈ A, b ∈ B] | r ∈ M) and
ui = P([ai = bi , a ∈ A, b ∈ B] | r ∈ U),
where ai and bi are the values of attribute i being compared
Based on these m- and u-probabilities, we calculate a matching
weight wi for attribute i as:
wi = log2(mi / ui ) if ai = bi Agreement weight
wi = log2((1-mi ) / (1- ui )) if ai ≠ bi Dis-agreement weight
Intensive Week - Lecture 16 Record Linkage Comparison
Comparing Record Pairs
Blocking process
generate groups and clusters;
From each block, record pairs are generated;
Record pairs are compared
based on common available attributes;
Exact comparison
of attributes will not generate good linkage results;
Approx. comparison
is required;
Similarities and Distances
Approx. matching functions
generally calculate a numerical similarity value:
sim = 0, two values completely different;
sim = 1, two values are exactly the same;
0 < sim < 1, two values are somewhat similar;
Distances can be converted into similarities
:
sim = 1/dist, with sim = 1 if dist = 0;
dist = 1- sim, if 0 <= dist <= 1
Distance functions
:
dist(val1, val2) = 0, distance from an object itself is always 0;
dist(val1, val2) >= 0, distance is always positive;
dist(val1, val2) = dist(val2, val1), distance is symmetric;
dist(val1, val2) + dist(val1, val3) >= dist(val2, val3), triangular inequality must hold;
Triangular inequality does not hold for certain functions;
Some functions are not symmetric; e.g. pete and peter.
Basic Comparison Functions
Exact comparison
;
S
exact(val1, val2) = 1 if val1 = val2, 0 otherwise.
val1 and val2 can be strings and numbers.
Truncate string comparison
:
S
trunc
(val1,val2)=1 if val1[:x]=val2[:x], 0 otherwise.
Phonetic encoded comparisons
:
strings only;
S
encode
(val1, val2) = 1 if encode(val1)=encode(val2), 0 otherwise
encode() is a phonetic encoding function as discussed.
Numerical Comparison Functions
For numerical values, we also want to have a comparison that calculates similarity between 0 and 1.
Absolute maximum difference of dmax and two values n1 and n2:
if abs(n
1
-n
2
)>=d
max
: SIM
num_abs
=0
if abs(n
1
-n
2
)<d
max
: SIM
num_abs
=1-(abs(n
1
-n
2
)/d
max
)
Similarity for maximum percentage difference:
(n
1
-n
2
)/n
1
Date and Time Comparison
Dates are often used when records are compared;
compare dates as strings is not a good idea;
Dates can be converted into number by counting the number of days since a certain fix date;
similarity of date, months, year can be calculated;
how dates recorded, phone, hand?
Specific issue with how dates are recorded;
dates and months are reversed;
Time values also are modulo;
Intensive Week - Lecture 18 Record pair classification (1)
Indigenous Definition and Identification
786689 indigenous Australians as 30/6/16;
Survey data is not updated such as indigenous population, this may impact ratio of early child education vs indigenous population.
The Australian Census longitudinal dataset
Linking conducted between first and last names;
Binary indigenous status gives four categories for analysis:
Always identified as indigenous;
Never identified;
Newly identified;
Formerly identified;
Data Validation - Census Groupings by PIT (Personal Income Tax) income
Factors of higher probability of PIT income:
Relatively young or old;
Not employed;
Factors of lower probability of PIT income:
Middle-age;
Not employed;
Low education;
Income transitions in the PIT change from 2010/2011 to 2011/2012
Variation in Mortality
There can be a link between socioeconomic position and mortality or disease.
Mortality vs education;
Conclusion of geographic vs mortality can help with location of hospitals.
Summary
Survey data limited for particular segments of the Australian Population;
Interesting characters of populations make it hard to link;
Access to data remains difficult - especially data on linkage error;
Intensive Week - Lecture 15: Blocking/indexing (1)
Improving Traditional Blocking
Problems of traditional blocking:
some problems can be overcome by careful selection of the attributes to be used as a blocking keys;
others can be addressed by using phonetic encoding;
Phonetic encoding
:
techniques that convert strings into some form of code according to hope a string is pronounced.
the earliest and still commonly used technique is Soundex;
various other techniques improve upon drawbacks of Soundex;
Phonetic encoding for blocking:
name variations are common in database used in linkage;
many name variations are valid;
e.g. 'gail' vs 'ayle' vs 'gale'
phonetic encoding can help bring together spelling variations of the same name for improved blocking;
Soundex Algorithm
A simple algorithm to convert (name) strings into codes made of
one letter and three digits
;
Steps
:
1) Keep first letter of a string
2) Remove all following occurrences of: a, e, i, o, u, y, h, w;
3) Replace all consonants from position 2 onwards with digits using these rules:
b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6
4) Only keep unique adjacent digits
5) If length of a code is less than 4 add zeros, if longer truncate at length 4;
Soundex problem:
A problem with most soundex-based systems is that they find too many approximate matches, most of which are so far off from the name being sought that they are obviously not useful hits.
Alternative Blocking:
Sorted Neighbourhood
Sorted neighbourhood is a popular approach as an alternative blocking technique;
Basic idea:
merge databases and sort them according to a sorting key;
slide a window over the sorted database;
compare records in the window;
using several passes with different sorting criteria;
Window size can be fixed or adaptive;
For a window of fixed size w, the number of comparisons becomes w * (|D1| + |D2|), where |D| is the number of records in a database;
database is sorted using first and last name;
Alternative Blocking:
Canopy Clustering
Based on a computationally ‘cheap’ similarity measure such as Jaccard (set intersection based on q-grams):
Records will be inserted into several clusters / blocks
Algorithm steps:
1) Randomly select a record in database D as cluster centroid ci , i = 1, 2, . . .
2) Insert all records that have a similarity of at least sloose with ci into cluster Ci
3) Remove all records rj ∈ Ci (including ci) from D that have a similarity of at least stight with ci, with stight ≥ sloose (remove from ci to sloose);
4) If database D not empty go back to step 1);
Other Blocking Techniques
Q-gram based blocking
(e.g. 2-grams / bigrams):
Convert attribute values into q-gram lists, then generate sub-lists;
Records with the same sub-list value are inserted into the same block;
Each record will be inserted into several blocks;
Works well for ‘dirty’ data but has high computational costs;
Mapping-based blocking;
Map strings into a multi-dimensional space such that distances between
strings are preserved
Many more advanced blocking techniques developed in recent years;
Intensive Week - Lecture 19: Record Pair Classification
Cost Based Classification
In record linkage classification we can make two types of mistakes:
(1) A record pair that is a true match (same entity) is classified as a
non-match (false negative);
(2) A record pair that is a true non-match (different entities) is classified as a
match (false positive);
Traditionally it is assumed both types of errors have the same
costs;
Based on the probabilistic record linkage approach, for record
pair r we can calculate the overall cost c as:
c(r) = cU,U
P(r ∈ non-match, r ∈ U) + cU,M
P(r ∈ non-match, r ∈ M) +
cM,U
P(r ∈ match, r ∈ U) + cM,M
P(r ∈ match, r ∈ M)
record pair is classified as a match or non-match while
its true match status is M or U;
aim is to minimise the overall costs c for all record pairs
Rule Based Classification
Different approach compared to probabilistic record linkage;
A set of rules is used to classify a record pair as a match or non-match (and possibly a potential match);
The ordering of rules is important
Rules should have high accuracy and high coverage;
accuracy of the rule compare to the dataset;
coverage should be generlised for the data;
Rule sets can be build manually or they can be learned:
Manually based on domain knowledge (time-consuming and expensive);
Learning based requires training data in the forms of true matching and non-matching record pairs;
Rule Based Classification (Example)
(sim(GivenName)[ri, rj] ≥ 0.9) ∧ (sim(Surname)[ri, rj] = 1.0) ∧ (sim(BMonth)[ri, rj] = 1.0) ∧ (sim(BYear)[ri, rj] = 1.0): [ri, rj] → Match
(sim(GivenName)[ri, rj] ≥ 0.7) ∧ (sim(Surname)[ri, rj] ≥ 0.8) ∧
(sim(StrName)[ri, rj] ≤ 0.6) ∧ (sim(Suburb)[ri, rj] ≤ 0.6): [ri, rj] → Non-Match
Machine Learning Based Classification
Machine learning algorithms learn patterns, classes, rules, or clusters from data;
Supervised techniques require training data in the form of ground truth:
Classification and regression techniques;
decision trees, support vector machines, neural networks, logistic regression, Bayesian classifiers, etc.
Unsupervised techniques do not require training data:
Cluster similar data points, or extract frequent patterns and rules from data;
clustering and association rule mining;
A major challenge for supervised techniques is to obtain training
data of good quality and variety:
Actual truth often not known;
Easy to get clear true matches and non-matches;
Difficult to get borderline cases (many more non-matching record pairs compared to matching ones);
Managing Transitive Closure
When record pairs are classified individually, the result might be inconsistent with regard to transitivity.
If record a1 is classified as a match with record a2, and a2 is classified as a match with record a3, then a1 and a3 must also be a match;
Special post-processing and clustering techniques need to be applied
Intensive Week - Lecture 21: Record Linkage Evaluation (2)
Why Clerical Review:
The traditional (probabilistic) record linkage process classifies
record pairs into the class of potential matches if no clear decision can be made;
These record pairs are given to a domain expert for manual classification;
Manual classification requires inspection of the attribute values of a record pair, and possibly also external information:
Other records about the same entities from related databases
Information found on the Internet
Information from the linkage process
Clerical Review Results to Improve Classification
A major challenge with record linkage classification is the lack of ground truth data in many applications;
The clerical review process generates training data;
Recent research has investigated crowed-based approaches to clerical review;
Test Databases and Benchmarks
To evaluate a record linkage system or software, it needs to be employed on a set of suitable databases from the same domain;
It is generally very difficult to obtain such test databases;
In other research areas (databases, machine learning, information retrieval) there are publicly available test data collections or even benchmarking systems;
In record linkage, individual research groups provide small test data collections;
Generating and using synthetic data (1)
Privacy issues prohibit publication of real personal information;
De-identified or encrypted data cannot be used for record linkage research or evaluation of record linkage systems;
Alternatively, create artificial / synthetic databases for research and evaluation, with several advantages;
Creating synthetic data is usually a two step process:
First generate data based on lookup tables with real values and their distributions;
– Then corrupt the generated data using realistic corruption functions;
The synthetic data can be generated to evaluate the ground truth real data; (GeCo Data generator)
Intensive Week - Lecture 22: Summary Intensive Week and Outlook
Record Linkage:
Record linkage is the process of linking records that represent the same entity in one or more databases;
Data cleaning and standardisation is important to make sure the databases to be linked have similar structure and content;
The aim of blocking is to cheaply remove candidate record pairs which are obviously not matches (i.e. are non-matches);
A crucial aspects of blocking is the selection of suitable blocking keys (accurate, complete, stable over time, uniformly distributed);
Record Linkage - 2
In the comparison step, the record pairs generated by blocking are compared in detail using appropriate comparison functions; (exact, numerical, approximate string matching functions, etc.);
For each
compared
record pair, a vector of similarities is calculated (with one similarity per compared attribute);
These are then
classified
into matches, non-matches, and (depending upon the classifier technique used) potential matches;
(using one the many classification techniques available, ranging from simple
threshold to sophisticated machine learning approaches);
The final evaluation considers both linkage quality and complexity (using recall, precision, reduction ratio, etc.)
Future challenges and research directions
Most linkage techniques assume static databases;
In the era of Big Data, we are dealing with data streams, and therefore real-time matching and dynamic approaches are needed;
Linking data across organisations raises privacy and confidentiality issues;
How do we link new complex types of data?
Multimedia data, relational data, unstructured data, groups of records ,etc.