Please enable JavaScript.
Coggle requires JavaScript to display documents.
03 Post-intensive Week, 04 Post-intensive Week 8 - Lecture 27: Ontology…
-
-
-
06 Post-intensive Week 6 - Lecture 25: Privacy aspects in data wrangling and privacy-preserving record linkage
- Privacy, confidentiality and security
- Privacy: right of entities to make decisions about how their personal data are shared and used;
- Confidentiality: obligration or responsibility of profectionals and organisations who have access to data to hold in confidence;
- Security: means or tools used to protect the privacy entities' data and to support professionals/organisations in holding data in confidence.
- Privacy by design
- Personal data are valuable for various applications, but are at risk of being used, stored, shared, exchanged, or revealed due to growing privacy concerns:
- important to have system in place that provide data protection;
- allow appliations and research studies utilise available information in data;
- Standards and regulations are required:
- safe environments to handle them in;
- proper handling procedures and output;
- safe storage;
- privacy laws, such as recent European Union GDPR;
- Privacy in data wrangling
- Preserve privacy and confidentiality of entities represented by data during the data wrangling pipeline;
- Privacy and confidentiality concerns arise when data are shared or exchanged between different organisations:
- mainly the tasks of data integration/record linkage in the pipeline that requires data to be integrated from multiple sources held by different organisations;
- Require disclosure limitation to protect the privacy and confidentiality of sensitive data.
- Disclosure limitations
- Filter or mask raw data to block what is revealed or disclosed;
- Disclosure-limited masking:
- using masking (encoding) functions to transform data such that there exists a specific functional relationship between the masked and the original data;
- budget-constrained problem: the goal of masking functions is to achieve the maximum uytility under a fixed privacy budget;
- examples include noise addition, generalisation and probabilistic filters;
- Privacy-preserving record linkage (PPRL)
- Objective of PPRL is to perform linkage across organisations using masked (encoded) records:
- besides certain attributes of the matched records no information about the sensitive original data can be learned by any party involved in the linkage, or any external party;
- PPRL: example applications
- Health outbreak systems
- early dection of infectious diseases before they spread widely;
- requires data to be integrated across human health data, travel data, consumed drug data, and even animal health data
- National security applications:
- integrate data from law enforcement agencies, internet service providers, and financial institutions to identify crime and fraud, or terrorism suspects
- business applications:
- compile mailing lists or integrate customer data from different sources for marketing activities and/or recommendation systems.
- Neither of the parties is willing or allowed by law to exchange or provide their data between/to other parties
- PPRL protocols:
- Three-party protocols: use a linkage unit to conduct/facilitate linkage;
- Two-party protocols only two database owners participate in the linkage;
- Multi-party protocols: linking records from multiple databases with the additional challenges of scalability and privacy risk
- PPRL Adversary models:
- Honest-but-curious (HBC) or semi-honest model
- Parties follow the protocol while being curious to learn about another party's data
- Most existing PPRL protocols assume HBC model
- Malicious model:
- parties may behave arbitrarily, not following the protocol;
- evaluating privacy under malicious model is difficult;
- Advanced models:
- accountable computing and covert model allow to identify if a party has not followed the protocol with a certain probability
- lower complexity than malicious and more secure than HBC
- Attack Models
- Dictionary attack;
- Frequency attack;
- Cryptanalysis attack;
- Collusion;
06 Post-intensive Week 6 - Lecture 25: Privacy aspects in data wrangling and privacy-preserving record linkage
- PPRL Techniques
- Several techniques developed:
- generalisation such as k-anonymity, noise addition and differential privacy;
- secure multi-party computation (SMC) such as homomorphic encryptions and secure summation;
- probabilistic filters such as bloom filters and variations;
- First generation (mid 90s): exact matching only;
- Second generation (early 00s): approximate matching but not scalable;
- Third generation (mid 00s): take scalability into account;
- Secure hash encoding
- First generation PPRL technique;
- Use a one-way hash-encoding function (like SHA) to encode value and then compare the hash-encoded values to identify matching records:
- only exact matching is possible;
- single character difference in two values results in a pair of completely different hash-encoded values;
- Having only access to hash-codes will make it nearly impossible to learn the original values:
- frequency attacks are still possible
- Noise and differential privacy
- Add noise to overcome frequency attack at the cost of loss in linkage quality;
- Differential privacy is an alternative to random noise addition:
- the probability of holding any property on the perturbed database is approximately the same whether or not an individual value is present in the database;
- magnitude of noise depends on privacy budget and sensitivity of data;
- Generalisation techniques
- Generalises the records to overcome frequency attacks
- Example: k-anonymity
- ensure every combination of attribute values is shared by at least k records;
- Other techniques - value generalisation hierarchies and binning (as covered earlier in the course)
- Encryption and SMC
- Commutative and homomorphic encryptions are used
- Computationally expensive
- Secure scalar product, secure set intersection, secure set union, and secure summmation are the most commonly used SMC techniques;
- Example: secure summation of values x1=6 and x2=5 using a LU
6.1. Bloom filters
- Probabilistic data structure
- bit vector of bits (initially all set to 0)
- k independent hash functions are used to hash-map each element in a set S into a Bloom filter by setting the corresponding bits to 1.
- Dice coefficient similarity of p BFs is calculated.
- dice_sim(b1,...bp) = p x z /(sum xi)
- peter (b1) = 1,1,0,1,1,0,0,1,1
- pete (b2) = 1,1,0,1,1,0,0,0,1
- x1 = 6, x2 = 5
- sim = 2 x 5 / (6+5) = 0.9
- Bloom filter-based matching:
- similarity of bloom filters can be calculated using a taken-based similarity function, such as Jaccard, Dice and Hamming;
- dice is mostly used, as it is insensitive to many matching zeros;
- similarity of bloom filters >= similarity of input values (due to false positive rate of Bloom filters)
- False positive rate determines privacy and accuracy:
- the larger the false positive rate, the higher the privacy but lower the accuracy;
6.2 Bloom filters
- Attacks on bloom filters
- susceptible to cryptanalysis attacks - mapping bit patterns to q-grams and values based on frequency and co-occurrence information
- several attack methods on basic bloom filters have been developed;
- We have recetly developed a new efficient attack method that allows re-identification of frequent attribute values:
- patterns in bloom filter bit positions that have co-occurring patterns are identified using pattern mining;
- the most successful attack method so far;
- Advanced bloom filter hardening techniques are required;
-
-
-