CompERBench: Complementing Entity Matching Benchmark Tasks

In an effort to enhance the reproducibility and comparability of matching methods, we complement existing benchmark tasks for entity matching with fixed development and test sets. In this page we provide 21 complete benchmark tasks for entity matching for public download. Additionally, we calculate and present baseline results for the 21 benchmark tasks using two standard classification methods. If you want to use our work, please cite our paper:

@inproceedings{10.1145/3340531.3412781,
author = {Primpeli, Anna and Bizer, Christian},
title = {Profiling Entity Matching Benchmark Tasks},
year = {2020},
isbn = {978-1-4503-6859-9/20/10},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3340531.3412781},
booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
location = {Virtual Event, Ireland},
series = {CIKM ’20} }

News

2020-31-07: Our paper with the title "Profiling Entity Matching Benchmark Tasks" was accepted in the Resource Track of CIKM 2020.

1. Motivation
2. Benchmark Matching Tasks - Download
3. Baseline Matching Method
4. Baseline Matching Results
5. References

1. Motivation

Entity matching is an important task for data integration and has been the focus of many research works. A large amount of entity matching tasks for benchmarking have been developed and made publicly available for evaluating, comparing, reproducing and showing the strengths of different matching methods. However, the lack of fixed development and test sets, correspondence sets including both matching and non-matching record pairs as well as baseline results, hinders reproducibility and comparability. We mitigate this problem by defining a standard execution procedure for complementing matching tasks, apply it in 20 benchmark matching tasks and establish baseline evaluation results.

2. Benchmark Matching Tasks - Download

Matching Task	Train Set	Valid. Set	Test Set	Feature Vector	Data sources	Acknowledgment
+/- Database Group Leipzig
abt-buy	gs_train (107KB)	gs_val (31KB)	gs_test (16KB)	feature_vector.zip (850KB)	records.zip (125KB)	[1]
amazon-google	gs_train (459KB)	gs_val (132KB)	gs_test (65KB)	feature_vector.zip (1.5MB)	records.zip (602KB)
dblp-acm	gs_train (1.1MB)	gs_val (343KB)	gs_test (168KB)	feature_vector.zip (4.6MB)	records.zip (244KB)
dblp-scholar	gs_train (2.6MB)	gs_val (767KB)	gs_test (375KB)	feature_vector.zip (9.9MB)	records.zip (3.9MB)
+/- DuDe Toolkit Repository
restaurants (Fodors-Zagats)	gs_train (7KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (61KB)	records.zip (24KB)	[5]
cora	gs_train (3.3MB)	gs_val (958KB)	gs_test (472KB)	feature_vector.zip (31.9MBKB)	records.zip (33KB)	[5]
+/- Magellan Data Repository
products (Walmart-Amazon)	gs_train (169KB)	gs_val (49KB)	gs_test (24KB)	feature_vector.zip (3.8MB)	records.zip (13.5MB)	[2]
baby products	gs_train (6KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (62KB)	records.zip (455KB)
beer	gs_train (6KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (46KB)	records.zip (162KB)
bikes	gs_train (6KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (42KB)	records.zip (241KB)
books (Goodreads-Barnes)	gs_train (6KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (55KB)	records.zip (1.4MB)
cosmetics	gs_train (5KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (34KB)	records.zip (247KB)
music (iTunes - Amazon)	gs_train (8KB)	gs_val (3KB)	gs_test (2KB)	feature_vector.zip (71KB)	records.zip (1.3MB)
restaurants (Yellow - Yelp)	gs_train (6KB)	gs_val (2KB)	gs_test (1KB)	feature_vector.zip (36KB)	records.zip (376KB)
+/- Web Data Commons Product Corpus
wdc_phones	gs_train (2.0MB)	gs_val (575KB)	gs_test (284KB)	feature_vector.zip (4.7MB)	records.zip (49KB)	[3]
wdc_headphones	gs_train (2.0MB)	gs_val (583KB)	gs_test (288KB)	feature_vector.zip (3.9MB)	records.zip (43KB)
wdc_tvs	gs_train (2.3MB)	gs_val (661KB)	gs_test (326KB)	feature_vector.zip (6.2MB)	records.zip (92KB)
+/- Web Data Commons Product Corpus for Large-Scale Product Matching 2.0
wdc_xlarge_cameras	gs_train (623KB)	gs_val (156KB)	gs_test (21KB)	feature_vector.zip (6.3MB)	records.zip (1.0MB)	[4]
wdc_xlarge_watches	gs_train (906KB)	gs_val (227KB)	gs_test (21KB)	feature_vector.zip (8.3MB)	records.zip (777KB)
wdc_xlarge_computers	gs_train (1.1MB)	gs_val (254KB)	gs_test (20KB)	feature_vector.zip (9.8MB)	records.zip (821KB)
wdc_xlarge_shoes	gs_train (627KB)	gs_val (157KB)	gs_test (20KB)	feature_vector.zip (5.8MB)	records.zip (577KB)

3. Baseline Matching Method

Feature Creation

In a binary entity matching setting, the attribute values of single records need to be transformed to pairwise record features. We use data type specific similarity metrics and calculate symbolic features for the pair of records. For short strings we apply similarity metrics: Levenshtein, Jaccard on the token level, Jaccard with inner Levenshtein, exact and containment. For long strings, the similarity metrics used for short strings are applied while Jaccard is calculated on the word level and additionally thecosine similarity with tfidf weighting per feature is computed. For numeric attributes, the absolute difference is computed. Finally, an overall similarity score of each record pair is computed using cosine similarity with tfidf weighting over the concatenated values of all attributes.

Here is an example of the transformation of record attribute values to pairwise symbolic features:

Classification Methods

We apply supervised learning using two classifiers: SVM and Random Forest. We use the validation set of every task to optimize some parameters. We apply the optimized model to the test set and report the evaluation scores on the test set. For all non-optimized parameters we use the default values of the python sklearn library (version 0.22). For all our experiments we use random seeds (always set to 1) to allow the exact reproducibility of our results.

You can find the code for evaluating and profiling the benchmark matching task in the EntityMatchingTaskProfiler github repository. Please use the Jupyter notebook MatchingTaskProfiler.ipynb and activate the flag parameters summaryFeatures for basic data set related statistics, baselineResults for getting the SVM and RF baseline results and profilingFeatures for task related statistics including our profiling dimensions. You can find more details on the task-related profiling dimensions and their calculation in our paper.

4. Baseline Matching Results

Matching Task	SVM			Random Forest
Matching Task	P	R	F1	P	R	F1
abt-buy	0.96	0.71	0.81	0.95	0.77	0.85
amazon-google	0.79	0.73	0.76	0.82	0.76	0.79
dblp-acm	1.00	1.00	1.00	1.00	1.00	1.00
dblp-scholar	0.99	0.99	0.99	0.99	0.99	0.99
restaurants (Fodors-Zagats)	1.00	0.91	0.95	1.00	1.00	1.00
products (Walmart-Amazon)	0.97	0.87	0.92	0.93	0.89	0.93
wdc_phones	0.85	0.88	0.86	0.85	0.88	0.86
wdc_headphones	0.89	0.77	0.83	0.95	0.82	0.88
wdc_tvs	0.93	0.78	0.85	0.94	0.89	0.91
wdc_xlarge_cameras	0.71	0.61	0.65	0.75	0.67	0.71
wdc_xlarge_watches	0.86	0.71	0.78	0.82	0.73	0.81
wdc_xlarge_computers	0.74	0.67	0.70	0.78	0.78	0.78
wdc_xlarge_shoes	0.83	0.43	0.57	0.82	0.38	0.52
baby products	0.70	0.64	0.67	0.68	0.55	0.63
beer	1.00	0.86	0.92	1.00	1.00	1.00
bikes	0.92	0.92	0.92	0.92	0.92	0.92
books (Goodreads-Barnes)	0.80	0.89	0.84	0.73	0.89	0.80
cosmetics	1.00	0.77	0.87	0.90	0.69	0.78
music (iTunes-Amazon)	1.00	1.00	1.00	1.00	1.00	1.00
restaurants (Yellow-Yelp)	1.00	1.00	1.00	1.00	1.00	1.00

5. References

Köpcke, H. et al.: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment. 3, 1–2, 484–493 (2010).
Konda, P. et al.: Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment. 9, 12, 1197–1208 (2016).
Petrovski, Petar, et al. The WDC gold standards for product feature extraction and product matching. International Conference on Electronic Commerce and Web Technologies. Springer, Cham, 2016.
Primpeli, A. et al.: The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In: Companion Proceedings of The 2019 World Wide Web Conference on - WWW ’19. pp. 381–386 ACM Press, San Francisco, USA (2019).
Draisbach, U. and Naumann, F.: DuDe: The Duplicate Detection Toolkit.In: QDB Workshop 2010.