Benchmark datasets for entity resolution

We offer several datasets for evaluating entity resolution that have been used in our own evaluations and that are made available for other reseachers. The initial set of datasets have been used for parwise matching of entities from two sources. The second set of datasets are also usable for entity clustering, mostly for more than two sources.

Datasets for Binary Entity Resolution

In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.

Task/source files	Domain	Attributes	#entities	#matches	Used in
DBLP-ACM	Bibliographic	title, authors, venue, year	2614+2294	2224	[0], [1], [2]
DBLP-Scholar	Bibliographic	title, authors, venue, year	2616+64263	5347	[0], [1], [2], [3]
Amazon-GoogleProducts	E-commerce	name, description, manufacturer, price	1363+3226	1300	[1], [2]
Abt-Buy	E-commerce	name, description, manufacturer, price	1081+1092	1097	[1], [2]

January 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and the VLDB2010 paper [1]

Datasets for Entity Clustering

Entity clustering is commonly used to determine matching entities within a single data source. It is also needed for matching entities from multiple (>2) sources to group all matching entities from different soutrces within entity clusters. We provide one benchmark dataset for single-source entity clustering and several ones for multi-source entity clustering with more than two duplicate-free sources:

The affiliation dataset contain affilations trings extracted from the PDFs of database publications that appeared between 2000 and 2009. The strings often contain multiple address components such as the name of the institution/company and often also the city name.
Geographic Settlements contains geographical real-world entities from four different data sources (DBpedia, Geonames, Freebase, NYTimes) and has already been used in the OAEI competition.
The Music Brainz dataset is based on real records about songs from the MusicBrainz database but uses the DAPO data generator to create duplicates with modified attribute values. The generated dataset consists of five sources and contains duplicates for 50% of the original records in two to five sources. All duplicates are generated with a high degree of corruption to stress-test the ER and clustering approaches.
The North Carolina Voters dataset is based on real person records from the North-Carolina voter registry and synthetically generated duplicates using the tool GeCo. We consider two configurations with either 5 or 10 sources each having 1 million entities; i.e. we process up to 10 million person records. Each source is duplicate free, but 50% of the entities are replicated in all sources without any corruption. Moreover, 25% of entities are corrupted and replicated in all sources, and the remaining 25% are corrupted but present in only some sources. For the generation of corrupted records we applied a moderate corruption rate of 20%, i.e., most attribute values remained unchanged.

Task/source files	Domain	Attributes	#sources	#entities	#matches	#clusters	used in	more information
Affiliations	Bibliographic, geography	affiliation string	1	2,260	32,816	330	[9], [10]
Geographic Settlements	geography	name, longitude, latitude	4	3,054	4,391	820	[4], [5], [6], [7], [8]
Music Brainz 20K	Music	artist, title, album, year, length	5	19,375	16,250	10,000	[4], [5], [6], [7], [8]	readMe
Music Brainz 200K			5	193,750	162,500	100,000
Music Brainz 2M			5	1,937,500	1,624,503	1,000,000	[7], [8]
Music Brainz 20M			5	19,375,000	16,250,000	10,000,000	[7]
North Carolina Voters 5M	Persons	name, surname, suburb, postcode	5	5,000,000	3,331,384	3,500,840	[4], [5], [6], [8]	readMe
North Carolina Voters 10M			10	10,000,000	14,995,973	6,625,848	[4], [6], [8]

June 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and, for the multi-source datasets, to the ADBIS2017 paper [4]