German English

Privacy Preserving Record Linkage (PPRL)

Overview

Record linkage aims at linking records that refer to the same real-world entity, such as persons. Typically there is a lack of global identifiers, therefore the linkage can only be achieved by comparing available quasi-identifiers, such as name, address or date of birth. However, in many cases, data owners are only willing or allowed to provide their data for such data integration if there is sufficient protection of sensitive information to ensure the privacy of persons, such as patients or customers. For example, in medical research data of several sources (e. g., hospitals) has to be matched to investigate possible correlations between some diseases of the same patients without revealing the identity of patients.

Privacy Preserving Record Linkage (PPRL) addresses this problem by providing techniques to match records while preserving their privacy allowing the combination of data from different sources for improved data analysis and research. For this purpose, the linkage of person-related records is based on encoded values of the quasi-identifiers and the data needed for analysis (e. g., health data) is separated from these quasi-identifiers. The relevant data can be provided to a researcher without the identifying data. Those PPRL approaches can be applied in many areas, such as public health surveillance, demographical studies and marketing analysis. PPRL is confronted with several challenges needing to be solved to ensure its practical applicability. In particular, a high degree of privacy has to be ensured by suitable encoding of sensitive data and organizational structures, such as the use of a trusted linkage unit. PPRL must achieve a high linkage quality by avoiding false or missing matches. Furthermore, a high efficiency with fast linkage time and scalability to large data volumes are needed. A main problem for performance is the inherent quadratic complexity of the linkage problem when every record of the first source is compared with every record of the second source. For better efficiency, the number of comparisons can be reduced by adopting blocking or filtering approaches. Furthermore, the matching can be performed in parallel on multiple processing nodes.

Research Results

In the last few years, we concentrated on improving the performance and scalability of our PPRL methods. As the records are represented as bit vectors (using Bloom filters for encoding), we used metric space similarity measures for filtering. In particular, the pivot-based approach for metric spaces utilizing the triangle inequality to reduce the search space showed significant improvement of performance compared to previous filter techniques. One data source is indexed by determining some records as pivots and assigning the leftover records to them. We can save many similarity computations by comparing the records of the second source with only the pivots first and exclude most records as possible matches.

Recently, we parallelized this approach utilizing the modern framework Apache Flink, which supports efficient parallel and distributed in-memory processing. We developed parallel algorithms for both determining the pivots and the pivot-based matching process for different strategies to select the pivots. The distributed implementation outperforms the centralized version for large dataset and improves the scalability of this approach.

Additionally, we investigated locality-sensitive hashing (LSH) as blocking method and also developed a distributed algorithm using Apache Flink. LSH enables probabilistic blocking by applying hash functions on bit vectors (Bloom filters). LSH-based blocking supports a flexible configuration and can be applied on encoded data. An extensive evaluation has shown that LSH-based blocking can achieve high quality and high scalability up to millions of records. In contrast, phonetic blocking approaches based on Soundex are susceptible to cryptanalysis and more sensitive with respect to data quality problems. Moreover, phonetic blocking based on a single attribute is not feasible for large datasets due to the low number of blocks and data skew effect introduced by frequent names.

Furthermore, we analyzed methods for multi-party PPRL considering applications, where more than two data holder are involved in the process of privacy preserving record linkage. The most challenging case of multiparty PPRL is to find subset matches, i. e. sets of matching records that are not in all sources, but in subsets of them. The intuitive way is to link the sources sequentially using for example the metric space as a growing index structure. In each iteration a new source is linked with the other already matched sources, then indexed. At the end of this process resulting clusters need to be cleaned if we consider that the sources are deduplicated (only one records from each source). Solutions to this problem are still a research topic.

Research Goals

  • Improve scalability of PPRL to enable fast linkage of millions of records from two or more data owners.
  • Increase linkage quality of PPRL by
    • investigating different encoding techniques,
    • low-loss blocking and filtering methods,
    • advanced similarity functions and classification techniques and
    • post-processing methods to apply link constraints.
  • Quantify and guarantee privacy properties of the encoding techniques used.
  • Bring PPRL approaches in practical applications.

PRIMAT - PRIvate MAtching Toolbox

We developed an open source (ALv2) toolbox, named PRIMAT, to ease the definition and execution of tailored PPRL workflows. It offers several components for data owners and the central linkage unit that provide state-of-the-art PPRL methods, including Bloom-filter-based encoding and hardening techniques, LSH-based blocking, post-processing and more.

Source Code

GitLab

Cooperations & Practical Applications

As part of the Medical Informatics Initiative within the SMITH consortium we develop a PPRL infrastructure to be used at hospitals and other health care facilities to allow privacy-preserving linkage of patient records enhancing clinical research and analysis.

The BMBF-funded DE4L project is pursuing the development of an intelligent ecosystem as part of a platform for data exchange for logistics service companies. We are striving to build an innovative Blockchain/Distributed Ledger-based trading platform for sensor data from logistics. For this purpose sensors are developed and different types of data are recorded. Moreover, we build methods and techniques for privacy-preserving trading, machine learning and data exchange, especially of sensor and business partner data.

Project Members

Students

  • Caroline Mösler
  • Thomas Hoppe
  • Lennart Laverenz
  • Max Schrodt
  • Elias Messner
  • Duc Dung Dao
  • Manuel Friedrich
  • Tim Häntschel
  • Manh Le
  • Theresa Butenschön

Related Publications

PDF

Google Scholar
publication iconLange, Lucas; Degenkolb, Borislav; Rahm, Erhard
Privacy-Preserving Stress Detection Using Smartwatch Health Data
4. Interdisciplinary Privacy & Security at Large Workshop, INFORMATIK 2023
2023-09
PDF

Google Scholar
publication iconVogel, Felix; Lange, Lucas
Privacy-Preserving Sentiment Analysis on Twitter
SKILL 2023
2023-09
PDF
further information
Google Scholar
publication iconIntemann, T.; Kaulke, K.; Kipker, D.; Lettieri V.; Stallmann, C.; Schmidt, C. O.; Geidel, L.; Bialke, M.; Hampf, C.; Stahl, D.; Lablans, M.; Rohde, F.; Franke, Martin; Kraywinkel, K; Kieschke, J.; Bartholomäusn, S.; Näher, A.-F.; Tremper, G; et al.
White Paper: Verbesserung des Record Linkage für die Gesundheitsforschung in Deutschland

2023-08
PDF
further information
Google Scholar
Lange, Lucas; Schreieder, Tobias; Christen, Victor; Rahm, Erhard
Privacy at Risk: Exploiting Similarities in Health Data for Identity Inference
arXiv preprint arXiv:2308.08310
2023-08
PDF
further information
Google Scholar
Lange, Lucas; Schneider, Maja; Christen, Peter; Rahm, Erhard
Privacy in Practice: Private COVID-19 Detection in X-Ray Images
20th International Conference on Security and Cryptography (SECRYPT 2023)
2023-07
PDF

Google Scholar
publication iconRohde, Florens; Franke, Martin; Christen, Victor; Rahm, Erhard
Value-specific Weighting for Record-level Encodings in Privacy-Preserving Record Linkage
Conference on Database Systems for Business, Technology and Web (BTW) 2023
2023
PDF
further information
Google Scholar
publication iconChristen, Victor; Häntschel, Tim; Christen, Peter; Rahm, Erhard
Privacy-Preserving Record Linkage using Autoencoders
International Journal of Data Science and Analytics
2022-12-16
PDF
further information
Google Scholar
Lange, Lucas; Schneider, Maja; Christen, Peter; Rahm, Erhard
Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)
arXiv preprint arXiv:2211.11434
2022-11
PDF

Google Scholar
publication iconSchneider, M.; Gehrke, L.; Christen, P.; Rahm, E.
D-TOUR : Detour-based point of interest detection in privacy-sensitive trajectories
Conference: 3. Interdisciplinary Privacy & Security at Large Workshop (Privacy&Security@Large), INFORMATIK 2022
2022-09
PDF
further information
Google Scholar
Franke, Martin; Sehili, Ziad; Rohde, Florens; Rahm, Erhard
Evaluation of Hardening Techniques for Privacy-Preserving Record Linkage
Proceedings of the 24th International Conference on Extending Database Technology (EDBT), pp. 289-300
2021-03
PDF

Google Scholar
Sehili, Z.; Rohde, F.; Franke, M.; Rahm, E.
Multi-Party Privacy Preserving Record Linkage in Dynamic Metric Space
Proc. 19. GI-Fachtagung für Datenbanksysteme für Business, Technologie und Web (BTW), 2021
2021-02
PDF
further information
Google Scholar
Rohde, Florens; Franke, Martin; Sehili, Ziad; Lablans, Martin; Rahm, Erhard
Optimization of the Mainzelliste software for fast privacy-preserving record linkage
BMC Journal of Translational Medicine
2021-01
PDF
further information
Google Scholar
Franke, M.; Sehili, Z.; Rahm, E.
PRIMAT: A Toolbox for Fast Privacy-preserving Matching
PVLDB 12(12): 1826-1829
2019-08
PDF
further information
Google Scholar
Franke, Martin; Gladbach, Marcel; Sehili, Ziad; Rohde, Florens; Rahm, Erhard
ScaDS Research on Scalable Privacy-preserving Record Linkage
Datenbank-Spektrum
2019-02
PDF
further information
Google Scholar
Franke, Martin; Sehili, Ziad; Gladbach, Marcel; Rahm, Erhard
Post-processing Methods for High Quality Privacy-Preserving Record Linkage
Data Privacy Management, Cryptocurrencies and Blockchain Technology. DPM 2018, CBT 2018. pp. 263–278. Lecture Notes in Computer Science, vol 11025. Springer
2018-09
PDF

Google Scholar
Gladbach, Marcel; Sehili, Ziad; Kudraß, Thomas; Christen, Peter; Rahm, Erhard
Distributed Privacy-Preserving Record Linkage using Pivot-based Filter Techniques
Proceedings of the 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW), pp. 33-38, 2018
2018-04
PDF
further information
Google Scholar
Franke, Martin; Sehili, Ziad; Rahm, Erhard
Parallel Privacy-Preserving Record Linkage using LSH-based blocking
3rd International Conference on Internet of Things, Big Data and Security (IoTBDS), pp. 195-203, 2018
2018-03
PDF

Google Scholar
Vatsalan, Dinusha; Sehili, Ziad; Peter Christen, Peter; Rahm, Erhard
Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges
In: Handbook of Big Data Technologies (eds.: A. Zomaya, S. Sakr) , Springer 2017
2017-02
PDF

Google Scholar
Vatsalan, D.; Christen, P.; Rahm, E.
Scalable privacy-preserving linking of multiple databases using Counting Bloom filters
Proc ICDM workshop on Privacy and Discrimination in Data Mining (PDDM)
2016-12
PDF
further information
Google Scholar
Sehili, Ziad; Rahm, Erhard
Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures
Datenbankspektrum 16, pp. 227-236
2016-11
PDF

Google Scholar
Sehili, Z.; Kolb, L.; Borgs, C.; Schnell, R.; Rahm, E.
Privacy Preserving Record Linkage with PPJoin
Proc. of 16. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW), 2015
2015-03