Data Partitioning for Parallel Entity Matching

Kirsten, T.; Kolb, L.; Hartung, M.; Groß, A.; Köpcke, H.; Rahm, E.
Data Partitioning for Parallel Entity Matching
Proc. 8th Intl. Workshop on Quality in Databases (QDB), 2010
2010-09

Beschreibung

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of input entities and affinity-based scheduling of match tasks.

Keywords

Distributed computing, Parallelism
Entity Resolution, Object matching, Similarity Join, Pair-wise comparison
Blocking, Clustering

BibTex

@inproceedings{data_partitioning_for_parallel_entity_matching,
  author    = {Toralf Kirsten and
               Lars Kolb and
               Michael Hartung and
               Anika Gross and
               Hanna K{\"o}pcke and
               Erhard Rahm},
  title     = {{Data Partitioning for Parallel Entity Matching}},
  booktitle = {8th International Workshop on Quality in Databases},
  year      = {2010}
}

Eingetragen von Lars Kolb. | 6 August, 2010 - 14:49

» Druckversion

Abteilung Datenbanken Leipzig

Inhalte

Neue Publikationen

Data Partitioning for Parallel Entity Matching

Beschreibung

Keywords

BibTex