1. OverviewWithin this website we provide access to the following resources relevant to our research:
- Code and application: we provide the full source code and a compiled application of the software we used to conduct our experiments.
- Datasets: we provide information to all relevant datasets used for this research. We also provide information on how to download those datasets and use them in our software. In case we use unpublished datasets, we provide them in full for download.
- Instructions for replicating experiments: we provide on this website detailed information and scripts for replicating our experiments. We also provide cached computation to facilitate easy replication.
- Full list of user-provided rules for weak supervision: finally, we provide the full list of all rules used in our work on weak supervision.
2. Used Datasets
In our research we make use of three datasets:
DBpedia 2014 Release
As our knowledge base to be extended we employ the DBpedia 2014 dataset [Download link]. This dataset is released in many individual parts. In Section 5.2. of this website we describe which specific datasets we use from this release. Generally, we only consider the english-language datastes. Also, we make use of the DBpedia as Tables [Download link] dataset to import the majority of DBpedia into our software. We also make use of the DBpedia ontology provided with the release.
WDC Web Table Corpus 2012
We use a web table corpus provided by the Web Data Commons project as web tables from which we extract new long-tail entities. For this we utilize the 2012 English-Language Relational Web Table Corpus [Download link]. Before the tables are fed into our pipeline, we match individual tables within the corpus to classes in DBpedia. This is done using the T2K Matching Framework [Ritze2015] [Code1, Code2]. Below in Section 5.2 we provide a subset of those tables already matched to DBpedia classes, which can be used to replicate our work on long-tail entity extraction.
T4LTE Dataset (Web Tables for Long-Tail Entity Extraction)
Finally, we created and used for evaluation and training purposes the T4LTE gold standard. The dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. More information about the T4LTE and download links are available at its own website. We already provide the gold standard files as part of an archive that contains the required file structure to replicate our experiments. This is described further in Section 5.2. below.
3. Code and Compilation
Our software is written in Java 1.8. You can download the full source code here. The archive consists of a maven multi-module project, that can be compile by running the following maven command on the parent module in the root directory.
This will compile all modules, and resolve all dependencies to output jars in a new directory created called app. This app directory has the required structure expected by our software. This is described further in Section 5.2. below. We also provide the already compiled bytecode, allowing you to skip code compilation.
The majority of the code is released under the MIT license. See LICENSE files within the code archive for more information.
4. Full List of User-Provided Rules for Weak Supervision
User provided rules for all classes (GF-Player, Song and Settlement) can be downloaded here.
5. Replicating Experiments
Below, we will include extensive instructions on how to replicate the experiments presented in the research listed above. We will describe the required directory and file structure, quickly introduce the caches that are generated and used throughout the experiments, and introduce the scripts that need to be run to replicate our work.
5.1. Minimum System Specifications
These are the minimum system specifications required to run our experiments:
- Java 8 (or higher) Runtime Environment
- Multi-Core CPU, with at least 24 cores to allow computation in acceptable time
- At least 100 GB hard drive space
- At least 500 GB in RAM Memory
- A Linux-based operating system, if you wish to use the shell scripts provided
5.2. File Structure and Dataset Downloads
The file structure described below is needed to replicate our work. You can download the complete structure by clicking on the link next to RootDirectory. The linked archive contains the fulls structure, all scripts, the T4LTE gold standard, miscellaneous configuration files, and the compiled application. It does not include the raw datasets or any caches, these can be downloaded separately. They are also linked within the description of the file structure below.
For some of the DBpedia datasets we had to convert the archive compression from BZIP2 to GZ. This is because the original BZIP2 files provided do not allow streaming. Streaming allows the datasets to be loaded faster and more efficiently. In regards to the DBpedia as Tables dataset, the TAR archive provided includes a subdirectory termed csv. This directory is not necessary and the files within it need to be extracted directly to the directory data/rawDBpediaDatasets/DBpediaAsTables within the file structure.
RootDirectory [Download file structure]
├─ app [Compiled code and libraries, included in file structure]
│ ├─ caches [Download ZIP file of complete caches, except FullRowClustering]
│ │ ├─ abstracts
│ │ ├─ FullRowClustering1_SimCache │ │ ├─ FullRowClustering2_BeforeKLJ
│ │ ├─ FullRowClustering3_Done
│ │ ├─ ImplicitAttributes
│ │ ├─ indegree │ │ ├─ kbs
│ │ ├─ models
│ │ ├─ PHI
│ │ ├─ tables_0
│ │ └─ tables_1
│ ├─ configs [Miscellaneous files, included in file structure download]
│ ├─ goldStandard [Gold standard, included in file structure download]
│ ├─ rawDBpediaDatasets
│ │ ├─ DBpediaAsTables [Download, Mirror]
│ │ │ ├─ Abbey.csv.gz
│ │ │ ├─ AcademicJournal.csv.gz │ │ │ ├─ ...... │ │ │ └─ Zoo.csv.gz
│ │ ├─ others
│ │ │ ├─ infobox_properties_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ labels_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ long_abstracts_en.nt.gz [Download in required GZ format, Original BZIP2 file] │ │ │ ├─ mappingbased_properties_en.nt.gz [Download in required GZ format, Original BZIP2 file] │ │ │ ├─ page_links_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ redirects_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ SurfaceForms_LRD-CC_filter_thr3.80.tsv.gz [Download, Mirror] │ │ │ └─ SurfaceForms_LRD-WAT_filter_thr2.60.tsv.gz [Download, Mirror] │ │ └─ dbpedia_2014.owl [Included in file structure, Original Mirror]
│ ├─ rawTables [Download]
│ └─ results
├─ 01b_importRawTables.sh ├─ 01c_runColumnMatching.sh ├─ 01d_learnModels.sh ├─ 02a_goldStandardRun.sh ├─ 02b_weakSupervisionRun.sh ├─ 02c_fullRunGridironFootballPlayer.sh
5.2. Using Caches
Throughout the software we use various caches to speed up loading times, and to cache preliminary results. These caches are created automatically when running the scripts below. You have the option whether to download the raw datasets, and import and process them manually, or to download the precached datasets and models, to speed up the experiments by skipping cache computation. If you use precomputed caches, you can skip running the setup scripts described in Section 5.4 and you do not need any of the raw datasets.
5.3. Setting up Configuration
Within the root directory, there is a file called 00_config.cfg. This file contains configuration used by all provided shell scripts. To use the scripts, you need to change configuration parameters within this file. These are mostly paths to certain directories, but also configuration on available memory.
5.4. Setup Scripts: Importing Data and Column Matching
You can skip running these scripts if you use the caches provided.
This script imports and processes DBpedia and its ontology from the various raw DBpedia datasets. It also runs some adaptations and fixes to DBpedia dataset and its ontology. The exact code for this can be found from lines 547 to 610 in the following Java file: expansion.samples/src/main/java/de/uni_mannheim/informatik/dws/expansion/samples/setExtension/SetExtension.java. The knowledge base and the ontology are cached within caches/kbs.
This script imports the tables from their raw CSV format, and transforms the tables into a format used throughout our experiments. The exact code can be found from lines 730 to 830 in the same file. The tables are cached in caches/tables_0.
This script performs the attribute-to-property matching on the imported tables. The tables with the correct correspondences are cached in caches/tables_1.
This script learns and caches the models used for the large-scale profiling. The models are cached in caches/models.
5.5. Main Scripts: Replicating Experiments
This script runs and tests the performance of our pipeline for the task of finding new long-tail entities on the T4LTE gold standard. It also provides performance for various methods for the individual row clustering and new detection. The results are also outputted in a CSV format at the end of the standard output of the application.
This script runs the various weak supervision approaches and tests them on the gold standard. The results are also outputted in a CSV format at the end of the standard output of the application.
02c_fullRunGridironFootballPlayer.sh, 02d_fullRunSong.sh and 02e_fullRunSettlement.sh
These three scripts run the pipeline on all tables of the
corpus. Various statistics are given to the standard output.
Sample of 100 "new" entities per class are stored in the data/results
directory for evaluation.
Please send questions and feedback directly to the authors
(listed above) or post them in the Web
Data Commons Google Group.
- [Cafarella2008] Cafarella, Michael J and Halevy, Alon Y and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web.", In Proceedings of the 11th International Workshop on the Web and Databases (WebDB '08).
- [Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '14). New York, NY, USA , pp. 601-610. ACM.
- [Hoffart2013] Hoffart, Johannes and Suchanek, Fabian M. and Berberich, Klaus and Weikum, Gerhard (2013), "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia", Artificial Intelligence. Vol. 194, pp. 28-61. Elsevier.
- [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, Sören and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
- [Oulabi2019a] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In Advances in Database Technology - 22nd International Conference on Extending Database Technology (EDBT '19), Lisbon, Portugal, March 26-29, 2019.
- [Oulabi2019b] Oulabi, Yaser and Bizer, Christian (2019), "Using Weak Supervision to Identify Long-Tail Entities for Knowledge Base Completion", In Proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, Karlsruhe, Germany, September 9-12, 2019.
- [Ratner2017] Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and Ré, Christopher (2017), "Snorkel: Rapid Training Data Creation with Weak Supervision", Proceedings of the VLDB Endowment, November, 2017. Vol. 11(3), pp. 269-282. VLDB Endowment.
- [Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (WIMS '15). New York, NY, USA , pp. 10:1-10:6. ACM.
- [Vrandecic2014] Vrandečić, Denny and
Krötzsch, Markus (2014), "Wikidata: A Free Collaborative
Knowledgebase", Communications of the ACM. Vol. 57(10), pp.