German English

Learning-based Entity Resolution with MapReduce


Google Scholar

publication iconKolb, L.; Köpcke, H.; Thor, A.; Rahm, E.
Learning-based Entity Resolution with MapReduce
Proc. 3rd Intl. Workshop on Cloud Data Management (CloudDB), 2011


Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.



  • MapReduce, Hadoop
  • Entity Resolution, Object matching, Similarity Join, Pair-wise comparison
  • Cartesian product
  • Machine Learning, Classification


  author = {Kolb, Lars and K\"{o}pcke, Hanna and Thor, Andreas and Rahm, Erhard},
  title = {{Learning-based Entity Resolution with MapReduce}},
  booktitle = {Proceedings of the third international workshop on Cloud data management},
  series = {CloudDB '11},
  year = {2011},
  isbn = {978-1-4503-0956-1},
  location = {Glasgow, Scotland, UK},
  pages = {1--6},
  numpages = {6},
  url = {},
  doi = {},
  acmid = {2064087},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {Cartesian product, Entity Resolution, Machine Learning, MapReduce},