Benchmark datasets for entity resolution
Benchmark datasets for entity resolution
We offer several datasets for evaluating entity resolution that have been used in our own evaluations and that are made available for other reseachers. The initial set of datasets have been used for parwise matching of entities from two sources. The second set of datasets are also usable for entity clustering, mostly for more than two sources.
Datasets for Binary Entity Resolution
In the VLDB 2010 paper [1]
we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution
approaches using challenging real-world match tasks.
The evaluation considers existing approaches both with and without using machine
learning to find suitable parameterization and combination of
similarity functions. In addition to approaches from the research
community a state-of-the-art commercial entity
resolution implementation is considered. Our results indicate significant quality
and efficiency differences between different approaches. We also
find that some challenging resolution tasks such as matching
product entities from online shops are not sufficiently solved with
conventional approaches based on the similarity of attribute
values.
Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.
Task/source files | Domain | Attributes | #entities | #matches | Used in |
---|
DBLP-ACM | Bibliographic | title, authors, venue, year | 2614+2294 | 2224 | [0], [1], [2] | DBLP-Scholar | Bibliographic | title, authors, venue, year | 2616+64263 | 5347 | [0], [1], [2], [3] | Amazon-GoogleProducts | E-commerce | name, description, manufacturer, price | 1363+3226 | 1300 | [1], [2] | Abt-Buy | E-commerce | name, description, manufacturer, price | 1081+1092 | 1097 | [1], [2] |
January 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and the VLDB2010 paper [1]
Datasets for Entity Clustering
Entity clustering is commonly used to determine matching entities within a single data source. It is also needed for matching entities from multiple (>2) sources to group all matching entities from different soutrces within entity clusters.
We provide one benchmark dataset for single-source entity clustering and several ones for multi-source entity clustering with more than two duplicate-free sources:
- The affiliation dataset contain affilations trings extracted from the PDFs of database publications that appeared between 2000 and 2009. The strings often contain multiple address components such as the name of the institution/company and often also the city name.
- Geographic Settlements contains geographical real-world entities from four different data sources (DBpedia, Geonames, Freebase, NYTimes) and has already been used in the OAEI competition.
- The Music Brainz dataset is based on real records about songs from the MusicBrainz database but uses the DAPO data generator to create duplicates with modified attribute values. The generated dataset consists of five sources and contains duplicates for 50% of the original records in two to five sources. All duplicates are generated with a high degree of corruption to stress-test the ER and clustering approaches.
- The North Carolina Voters dataset is based on real person records from the North-Carolina voter registry and synthetically generated duplicates using the tool GeCo. We consider two configurations with either 5 or 10 sources each having 1 million entities; i.e. we process up to 10 million person records. Each source is duplicate free, but 50% of the entities are replicated in all sources without any corruption. Moreover, 25% of entities are corrupted and replicated in all sources, and the remaining 25% are corrupted but present in only some sources. For the generation of corrupted records we applied a moderate corruption rate of 20%, i.e., most attribute values remained unchanged.
Task/source files | Domain | Attributes | #sources | #entities | #matches | #clusters | used in | more information |
---|
Affiliations | Bibliographic, geography | affiliation string | 1 | 2,260 | 32,816 | 330 | [9], [10] | | Geographic Settlements | geography | name, longitude, latitude | 4 | 3,054 | 4,391 | 820 | [4], [5], [6], [7], [8] | | Music Brainz 20K | Music | artist, title, album, year, length | 5 | 19,375 | 16,250 | 10,000 | [4], [5], [6], [7], [8] | readMe | Music Brainz 200K | | | 5 | 193,750 | 162,500 | 100,000 | | | Music Brainz 2M | | | 5 | 1,937,500 | 1,624,503 | 1,000,000 | [7], [8] | | Music Brainz 20M | | | 5 | 19,375,000 | 16,250,000 | 10,000,000 | [7] | | North Carolina Voters 5M | Persons | name, surname, suburb, postcode | 5 | 5,000,000 | 3,331,384 | 3,500,840 | [4], [5], [6], [8] | readMe | North Carolina Voters 10M | | | 10 | 10,000,000 | 14,995,973 | 6,625,848 | [4], [6], [8] | |
June 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and, for the multi-source datasets, to the ADBIS2017 paper [4]
Publications
Contact/Project Members
|