Entity Matching for Big Data

Overview

Automatically matching entities (objects) and ontologies are key technologies to semantically integrate heterogeneous data. These match techniques are needed to identify equivalent data objects (duplicates) or semantically equivalent metadata elements (ontology concepts, schema attributes). The proposed techniques demand very high resources that limit their applicability to large-scale (Big Data) problems unless a powerful cloud infrastructure can be utilized. This is because the (fuzzy) match approaches basically have a quadratic complexity to compare the all elements to be matched with each other. For sufficient match quality, multiple match algorithms need to be applied and combined within so-called match workflows adding further resource requirements as well as a significant optimization problem to select matchers and configure their combination.

Aims

Efficient parallel execution of match workflows in the cloud
Efficient application of machine learning models for entity / ontology matching

General Entity Matching Workflow

The common approach to improve efficiency is to reduce the search space by adopting blocking techniques. They utilize a blocking key on the values of one or several entity attributes to partition the input data into multiple partitions (called blocks) and restrict the subsequent matching to entities of the same block. For example, product entities may be partitioned by manufacturer values such that only products of the same manufacturer are evaluated to find matching entity pairs.

Entity Matching with MapReduce

The MapReduce model is well suited to execute blocking-based entity matching in parallel within several map and reduce tasks. In particular, several map tasks can read the input entities in parallel and redistribute them among several reduce tasks based on the blocking key. This guarantees that all entities of the same block are assigned to the same reduce task so that different blocks can be matched in parallel by multiple reduce tasks.

Skew Handling / Load Balancing: Basic MR implementation of entity matching is susceptible to severe load imbalances due to skewed blocks sizes since the match work of entire blocks is assigned to a single reduce task. We are developing effective load balancing approaches to data skew handling for MR-based entity matching (and, in general, all kind of pairwise similarity computation). For a quick overview see our CIKM 2011 poster (right).

Redundancy-free comparisons: Modern entity matching approaches assign entities to more than one block. For example, multi-pass approaches use several blocking keys to still achieve high recall in the presence of noisy data. Similarily, token-based matching approaches (e.g., PPJoin) generate a list of tokens (i.e., blocks) for each entity. Such overlapping blocks may lead to mutliple comparisons of the same pair of entity which in turn decreases efficieny. We therefore develop redundancy-free approaches that ensure that every entity pair is processed by one reduce task only.

Dedoop

We summarized our latest research results in a prototype, called Dedoop (Deduplication with Hadoop). It is a powerful and easy-to-use tool for MapReduce-based entity resolution of large datasets.

Dedoop…

Provides a Web interface to specify entity resolution strategies for match tasks
Automatically transforms the specification into an executable MapReduce workflow and manages its submission and execution on Hadoop clusters
Is designed to serve multiple users that may simultaneously execute multiple workflows on the same or different Hadoop clusters
Provides load balancing strategies to evenly utilize all nodes of the cluster
Includes a graphical HDFS and S3 file manager
Supports the repeated execution of jobs within a MapReduce workflow until a stop condition if fulfilled
Automatically spawns and terminates SOCKS proxy servers to pass connections to Hadoop nodes on Amazon EC2
Supports to configure and launch new Hadoop clusters on Amazon EC2
Can be adapted to configure, schedule, and monitor arbitrary MapReduce workflows

Acknowledgements

This project is supported by an AWS in Education research grant award.

Project Members

Publications

Sehili, Z.; Kolb, L.; Borgs, C.; Schnell, R.; Rahm, E.
Privacy Preserving Record Linkage with PPJoin
Proc. of 16. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW), 2015
2015-03

Kolb, L.
Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows
Dissertation, Universität Leipzig
2014-09

Kolb, L.; Sehili, Z.; Rahm, E.
Iterative Computation of Connected Graph Components with MapReduce
Datenbank-Spektrum 14 (2), 2014
2014-07

Kolb, L.; Thor, A.; Rahm, E.
Don't Match Twice: Redundancy-free Similarity Computation with MapReduce
Proc. 2nd Intl. Workshop on Data Analytics in the Cloud (DanaC), 2013
2013-06

Ngonga Ngomo A.-C.; Kolb, L.; Heino, N.; Hartung, M.; Auer, S.; Rahm, E.
When to Reach for the Cloud: Using Parallel Hardware for Link Discovery
Proc. 10th Intl. Extended Semantic Web Conference (ESWC), 2013
2013-05

Kolb, L.; Rahm, E.
Parallel Entity Resolution with Dedoop
Datenbank-Spektrum 13 (1), 2013
2013-02-23

Kolb, L.; Thor, A.; Rahm, E.
Dedoop: Efficient Deduplication with Hadoop
Proc. 38th Intl. Conference on Very Large Databases (VLDB) / Proc. of the VLDB Endowment 5(12), 2012
2012-08

Kolb, L.; Thor, A.; Rahm, E.
Load Balancing for MapReduce-based Entity Resolution
Proc. 28th Intl. Conference on Data Engineering (ICDE), 2012
2012-04

Kolb, L.; Thor, A.; Rahm, E.
Multi-pass Sorted Neighborhood Blocking with MapReduce
Computer Science - Research and Development 27(1), 2012
2012-02

Kolb, L.; Köpcke, H.; Thor, A.; Rahm, E.
Learning-based Entity Resolution with MapReduce
Proc. 3rd Intl. Workshop on Cloud Data Management (CloudDB), 2011
2011-10

Kolb, L; Thor, A.; Rahm, E.
Block-based Load Balancing for Entity Resolution with MapReduce
Proc. 20th Intl. Conference on Information and Knowledge Management (CIKM), 2011
2011-10

Kolb, L.; Thor, A.; Rahm, E.
Parallel Sorted Neighborhood Blocking with MapReduce
Proc. 14th GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW), 2011
2011-03

Other related publications

Fisher, Jeffrey; Christen, Peter; Wang, Qing; Rahm, Erhard
A Clustering-Based Framework to Control Block Sizes for Entity Resolution
Proc. 21st ACM SIGKDD Conf. on Knowledge Discovery and Mining (KDD), 279-288, 2015
2015-08

Groß, A.
Evolution von ontologiebasierten Mappings in den Lebenswissenschaften
Dissertation, Universität Leipzig
2014-03

Hartung, M.; Kolb, L.; Groß, A.; Rahm, E.
Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA
Proc. 9th Intl. Conference on Data Integration in the Life Sciences (DILS), 2013
2013-07

Kirsten, T.; Groß, A.; Hartung, M.; Rahm, E.
GOMMA: A Component-based Infrastructure for managing and analyzing Life Science Ontologies and their Evolution
Journal of Biomedical Semantics 2011, 2:6
2011-11

Groß, A.; Hartung, M.; Kirsten, T.; Rahm, E.
Mapping Composition for Matching Large Life Science Ontologies
2nd International Conference on Biomedical Ontology (ICBO 2011)
2011-07

Rahm, E.
Towards large-scale schema and ontology matching
Schema Matching and Mapping, Springer-Verlag
2011-02

Köpcke, H.; Thor, A.; Rahm, E.
Evaluation of entity resolution approaches on real-world match problems
Proc. 36th Intl. Conference on Very Large Databases (VLDB) / Proceedings of the VLDB Endowment 3(1), 2010
2010-09

Kirsten, T.; Kolb, L.; Hartung, M.; Groß, A.; Köpcke, H.; Rahm, E.
Data Partitioning for Parallel Entity Matching
Proc. 8th Intl. Workshop on Quality in Databases (QDB), 2010
2010-09

Groß, A.; Hartung, M.; Kirsten, T.; Rahm, E.
On Matching Large Life Science Ontologies in Parallel
7th International Conference on Data Integration in the Life Sciences (DILS 2010)
2010-08

Köpcke, H.; Thor, A.; Rahm, E.
Comparative evaluation of entity resolution approaches with FEVER
Proc. 35th Intl. Conference on Very Large Databases (VLDB), 2009 (demo)
2009-08

» Druckversion

Abteilung Datenbanken Leipzig

Inhalte

Neue Publikationen