German English

Multi-pass Sorted Neighborhood Blocking with MapReduce

PDF
further information
Google Scholar
Kolb, L.; Thor, A.; Rahm, E.
Multi-pass Sorted Neighborhood Blocking with MapReduce
Computer Science - Research and Development 27(1), 2012
2012-02

Further information: http://www.springerlink.com/content/57h4677326nh4g27/

Description

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.

Keywords

  • MapReduce, Hadoop
  • Entity Resolution, Object matching, Similarity Join, Pair-wise comparison
  • Blocking, Sliding Window, Sorted Neighborhood
  • Multi-pass, Load balancing

BibTex

@article {springerlink:10.1007/s00450-011-0177-x,
   author = {Kolb, Lars and Thor, Andreas and Rahm, Erhard},
   affiliation = {Institut für Informatik, Fakultät für Mathematik und Informatik, Universität Leipzig, PF 100920, 04009 Leipzig, Germany},
   title = {{Multi-pass Sorted Neighborhood Blocking with MapReduce}},
   journal = {Computer Science - Research and Development},
   publisher = {Springer Berlin / Heidelberg},
   issn = {1865-2034},
   keyword = {Informatik},
   pages = {45-63},
   volume = {27},
   issue = {1},
   url = {http://dx.doi.org/10.1007/s00450-011-0177-x},
   note = {10.1007/s00450-011-0177-x},
   year = {2012}
}