Anna Primpeli
Christian Bizer

In an effort to enhance the reproducibility and comparability of matching methods, we complement existing benchmark tasks for entity matching with fixed development and test sets. In this page we provide 21 complete benchmark tasks for entity matching for public download. Additionally, we calculate and present baseline results for the 21 benchmark tasks using two standard classification methods. If you want to use our work, please cite our paper:

@inproceedings{10.1145/3340531.3412781,
author = {Primpeli, Anna and Bizer, Christian},
title = {Profiling Entity Matching Benchmark Tasks},
year = {2020},
isbn = {978-1-4503-6859-9/20/10},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3340531.3412781},
booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
location = {Virtual Event, Ireland},
series = {CIKM ’20} }

News

Contents

1. Motivation

Entity matching is an important task for data integration and has been the focus of many research works. A large amount of entity matching tasks for benchmarking have been developed and made publicly available for evaluating, comparing, reproducing and showing the strengths of different matching methods. However, the lack of fixed development and test sets, correspondence sets including both matching and non-matching record pairs as well as baseline results, hinders reproducibility and comparability. We mitigate this problem by defining a standard execution procedure for complementing matching tasks, apply it in 20 benchmark matching tasks and establish baseline evaluation results.

2. Benchmark Matching Tasks - Download

Matching Task Train Set Valid. Set Test Set Feature Vector Data sources Acknowledgment
+/- Database Group Leipzig
abt-buy gs_train (107KB) gs_val (31KB) gs_test (16KB) feature_vector.zip (850KB) records.zip (125KB) [1]
amazon-google gs_train (459KB) gs_val (132KB) gs_test (65KB) feature_vector.zip (1.5MB) records.zip (602KB)
dblp-acm gs_train (1.1MB) gs_val (343KB) gs_test (168KB) feature_vector.zip (4.6MB) records.zip (244KB)
dblp-scholar gs_train (2.6MB) gs_val (767KB) gs_test (375KB) feature_vector.zip (9.9MB) records.zip (3.9MB)
+/- DuDe Toolkit Repository
restaurants (Fodors-Zagats) gs_train (7KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (61KB) records.zip (24KB) [5]
cora gs_train (3.3MB) gs_val (958KB) gs_test (472KB) feature_vector.zip (31.9MBKB) records.zip (33KB)
+/- Magellan Data Repository
products (Walmart-Amazon) gs_train (169KB) gs_val (49KB) gs_test (24KB) feature_vector.zip (3.8MB) records.zip (13.5MB) [2]
baby products gs_train (6KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (62KB) records.zip (455KB)
beer gs_train (6KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (46KB) records.zip (162KB)
bikes gs_train (6KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (42KB) records.zip (241KB)
books (Goodreads-Barnes) gs_train (6KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (55KB) records.zip (1.4MB)
cosmetics gs_train (5KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (34KB) records.zip (247KB)
music (iTunes - Amazon) gs_train (8KB) gs_val (3KB) gs_test (2KB) feature_vector.zip (71KB) records.zip (1.3MB)
restaurants (Yellow - Yelp) gs_train (6KB) gs_val (2KB) gs_test (1KB) feature_vector.zip (36KB) records.zip (376KB)
+/- Web Data Commons Product Corpus
wdc_phones gs_train (2.0MB) gs_val (575KB) gs_test (284KB) feature_vector.zip (4.7MB) records.zip (49KB) [3]
wdc_headphones gs_train (2.0MB) gs_val (583KB) gs_test (288KB) feature_vector.zip (3.9MB) records.zip (43KB)
wdc_tvs gs_train (2.3MB) gs_val (661KB) gs_test (326KB) feature_vector.zip (6.2MB) records.zip (92KB)
+/- Web Data Commons Product Corpus for Large-Scale Product Matching 2.0
wdc_xlarge_cameras gs_train (623KB) gs_val (156KB) gs_test (21KB) feature_vector.zip (6.3MB) records.zip (1.0MB) [4]
wdc_xlarge_watches gs_train (906KB) gs_val (227KB) gs_test (21KB) feature_vector.zip (8.3MB) records.zip (777KB)
wdc_xlarge_computers gs_train (1.1MB) gs_val (254KB) gs_test (20KB) feature_vector.zip (9.8MB) records.zip (821KB)
wdc_xlarge_shoes gs_train (627KB) gs_val (157KB) gs_test (20KB) feature_vector.zip (5.8MB) records.zip (577KB)

3. Baseline Matching Method

Feature Creation

In a binary entity matching setting, the attribute values of single records need to be transformed to pairwise record features. We use data type specific similarity metrics and calculate symbolic features for the pair of records. For short strings we apply similarity metrics: Levenshtein, Jaccard on the token level, Jaccard with inner Levenshtein, exact and containment. For long strings, the similarity metrics used for short strings are applied while Jaccard is calculated on the word level and additionally thecosine similarity with tfidf weighting per feature is computed. For numeric attributes, the absolute difference is computed. Finally, an overall similarity score of each record pair is computed using cosine similarity with tfidf weighting over the concatenated values of all attributes.

Here is an example of the transformation of record attribute values to pairwise symbolic features:

Feature Vector Creation Example

Classification Methods

We apply supervised learning using two classifiers: SVM and Random Forest. We use the validation set of every task to optimize some parameters. We apply the optimized model to the test set and report the evaluation scores on the test set. For all non-optimized parameters we use the default values of the python sklearn library (version 0.22). For all our experiments we use random seeds (always set to 1) to allow the exact reproducibility of our results.

You can find the code for evaluating and profiling the benchmark matching task in the EntityMatchingTaskProfiler github repository. Please use the Jupyter notebook MatchingTaskProfiler.ipynb and activate the flag parameters summaryFeatures for basic data set related statistics, baselineResults for getting the SVM and RF baseline results and profilingFeatures for task related statistics including our profiling dimensions. You can find more details on the task-related profiling dimensions and their calculation in our paper.

4. Baseline Matching Results

Matching Task SVM Random Forest
P R F1 P R F1
abt-buy 0.96 0.71 0.81 0.95 0.77 0.85
amazon-google 0.79 0.73 0.76 0.82 0.76 0.79
dblp-acm 1.00 1.00 1.00 1.00 1.00 1.00
dblp-scholar 0.99 0.99 0.99 0.99 0.99 0.99
restaurants (Fodors-Zagats) 1.00 0.91 0.95 1.00 1.00 1.00
products (Walmart-Amazon) 0.97 0.87 0.92 0.93 0.89 0.93
wdc_phones 0.85 0.88 0.86 0.85 0.88 0.86
wdc_headphones 0.89 0.77 0.83 0.95 0.82 0.88
wdc_tvs 0.93 0.78 0.85 0.94 0.89 0.91
wdc_xlarge_cameras 0.71 0.61 0.65 0.75 0.67 0.71
wdc_xlarge_watches 0.86 0.71 0.78 0.82 0.73 0.81
wdc_xlarge_computers 0.74 0.67 0.70 0.78 0.78 0.78
wdc_xlarge_shoes 0.83 0.43 0.57 0.82 0.38 0.52
baby products 0.70 0.64 0.67 0.68 0.55 0.63
beer 1.00 0.86 0.92 1.00 1.00 1.00
bikes 0.92 0.92 0.92 0.92 0.92 0.92
books (Goodreads-Barnes) 0.80 0.89 0.84 0.73 0.89 0.80
cosmetics 1.00 0.77 0.87 0.90 0.69 0.78
music (iTunes-Amazon) 1.00 1.00 1.00 1.00 1.00 1.00
restaurants (Yellow-Yelp) 1.00 1.00 1.00 1.00 1.00 1.00

5. References

  1. Köpcke, H. et al.: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment. 3, 1–2, 484–493 (2010).
  2. Konda, P. et al.: Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment. 9, 12, 1197–1208 (2016).
  3. Petrovski, Petar, et al. The WDC gold standards for product feature extraction and product matching. International Conference on Electronic Commerce and Web Technologies. Springer, Cham, 2016.
  4. Primpeli, A. et al.: The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In: Companion Proceedings of The 2019 World Wide Web Conference on - WWW ’19. pp. 381–386 ACM Press, San Francisco, USA (2019).
  5. Draisbach, U. and Naumann, F.: DuDe: The Duplicate Detection Toolkit.In: QDB Workshop 2010.