German English

Advancing data integration: Privacy and semantics for record linkage




The research project “Advancing data integration: Privacy and semantics for record linkage” is a collaborative project between The Australian National University, ANU and the Leipzig University. The project is currently funded by Deutscher Akademischer Austauschdienst (DAAD) and Universities Australia within the Australia-Germany Joint Research Cooperation Scheme.

Project Objectives

Integrating data across disparate sources is an increasingly important task in many data mining and Big Data projects in applications ranging from national security to health and social science research. Two major tasks in data integration are ontology matching (OM), where the aim is to identify matching concepts across diverse ontologies or annotate entities with concepts of ontologies; and record linkage (RL, also known as data matching and entity resolution), where the aim is to identify all records that refer to the same real-world entity across several databases. Within this project we focus on the following two main topics:

Scalable parallel privacy-preserving record linkange (PPRL)

In health and social sciences specific methods for privacy-preserving record linkage (PPRL) play a key role. In this research we will adapt the PPRL protocols developed at the Australian National University (ANU) onto our modern parallel computing systems (available in ScaDS Dresden/Leipzig, one of two German centers of excellence on Big Data) to enable privacy-preserving linking of massive-scale databases across many parties. We will address how computational components of PPRL protocols can be best parallelized to improve their scalability to large datasets, how a data center can be used as a powerful linkage unit so computations are centralized, and how to employ advanced classification techniques for PPRL in parallel environments.

Semantic matching using ontologies

In data integration, RL and OM have largely been investigated independently. However, we see a large potential in combining entity and ontology matching techniques. In health sciences such approaches are vital for the annotation of biomedical entities with ontology concepts in order to enrich their description and usability for data analysis. We will investigate automatic annotation and indexing methods using ontology concepts for large-scale matching of biomedical entities. We will incorporate ontology concepts and textual features into a novel clustering framework to enhance matching quality, and extend techniques to match entities from multiple (i.e. more than two) heterogeneous databases based on the developed indices on shared ontology concepts.

Project Members

Database Group - Universität Leipzig, ScaDS Dresden/Leipzig

Research School of Computer Science, The Australian National University