Description
Privacy-preserving record linkage (PPRL) aims at linking person-related records from different data sources while protecting privacy. It is applied in medical research to link health data without revealing sensible person-related data. We propose and evaluate a new parallel PPRL approach based on Apache Flink that aims at high performance and scalability to large datasets. The approach supports a pivot-based filtering method for metric distance functions that saves many similarity computations. We describe our distributed approaches to determine pivots and pivot-based linkage. We also demonstrate the high efficiency of the approach for different datasets and configurations.