Extracting Long Tail Entities from Web Tables for Augmenting Cross-Domain Knowledge Bases

Yaser Oulabi and Christian Bizer, 01.07.2019 (last updated on 01.09.2020)

University of Mannheim - Logo

This website provides resources relevant to our research on the topic of long-tail entity extraction from web tables for the purpose of augmenting a cross-domain knowledge base with previously unknown entities.

Cross-domain knowledge bases like YAGO [Hoffart2013], DBpedia [Lehmann2015], Wikidata [Vrandecic2014], or the Google Knowledge Graph are being employed for an increasing range of applications, including natural language processing, web search, and question answering. YAGO, DBpedia, and Wikidata all rely on data extracted from Wikipedia and as a result cover mostly head instances that fulfill the Wikipedia notability criteria [Oulabi2019a]. Their coverage of less well known instances from the long tail is rather low [Dong2014]. As the usefulness of a knowledge base increases with its completeness, adding long-tail instances to a knowledge base is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics, and describe very specific long tail instances. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases with new long-tail entities.

Extracting long-tail entities from web tables for knowledge base augmentation is a non-trivial task. It consists of the two subtasks: (1) identifying long-tail entities that are not yet part of the knowledge base and (2) compiling from web table data descriptions for those new entities according to the knowledge base schema. In our research, we suggest and evalute the use of a pipeline, shown in Figure 1, that begins with web tables and ends by adding new entities to a cross-domain knowledge base. The pipeline consists of various components, including schema matching, row clustering, entity creation and new detection. During the row clustering, all rows that describe the same real-world instance are clustered together. This allows us to determine the total number of unique entities described by the input web tables, which equals the number of created clusters. From these clusters, we then compile entity descriptions, using the entity creation component and the schema correspondences generated during the schema matching. Finally, the new detection component determines which entities are new, given a specific target knowledge base. As a result, we are able to perform the two subtasks of identifying new entities and compiling their descriptions, albeit, the steps are performed in reverse. The detailed method is described and evaluated extensively in our published research on this topic.

Long-Tail Entity Extraction Pipeline Pipeline
Figure 1: Long-Tail Entity Extraction Pipeline

We published overall two works in this area:

  1. In the first [Oulabi2019a] we introduce and extensively test the Long-Tail Entity Extraction Pipeline. In our work, suggest and evaluate various implementations for the individual components of the pipeline, and evaluate the overall performance of the pipeline in finding previously unknown long-tail entities. We then use this pipeline to profile the potential of web tables for adding new entities to a knowledge base. We find that given the evaluated classes, that we are able to add 14 thousand new gridiron football players and 187 thousand new songs to DBpedia, an increase of 67 and 356 % respectively.

    For evaluation and training purposes we created the T4LTE Dataset, which stand for "Web Tables for Long-Tail Entity Extraction". The dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. In regards to training, we require for both, the row clustering and the new detection component, class-specific manually labeled training data in the form of entity matches.

  2. Given that knowledge bases can have many classes, requiring manually labeled training data limits the usefulness of automatic knowledge base augmentation from web tables. In our second work [Oulabi2019b], we therefore investigate the possibility of reducing labeling effort through weak supervision. Weak supervision describes approaches that reduce labeling effort by exploiting supervision that is more abstract or noisier in nature than manually labeled training examples [Ratner2017]. We introduce weak supervision in the form of a small number of class-specific user-provided bold matching rules. We then suggest a method, that uses these rules and a set of unlabeled web tables to bootstrap a supervised learning algorithm. This approach allows us to achieve a performance close to that when using manually labeled training data, while allowing us to perform long-tail entity extraction at web-scale.

Extended versions of both papers were published as part of [Oulabi2020].


Contents

1. Overview

Within this website we provide access to the following resources relevant to our research:
  • Code and application: we provide the full source code and a compiled application of the software we used to conduct our experiments.
  • Datasets: we provide information to all relevant datasets used for this research. We also provide information on how to download those datasets and use them in our software. In case we use unpublished datasets, we provide them in full for download.
  • Instructions for replicating experiments: we provide on this website detailed information and scripts for replicating our experiments. We also provide cached computation to facilitate easy replication.
  • Full list of user-provided rules for weak supervision: finally, we provide the full list of all rules used in our work on weak supervision.

2. Used Datasets

In our research we make use of three datasets:

DBpedia 2014 Release

As our knowledge base to be extended we employ the DBpedia 2014 dataset [Download link]. This dataset is released in many individual parts. In Section 5.2. of this website we describe which specific datasets we use from this release. Generally, we only consider the english-language datastes.  Also, we make use of the DBpedia as Tables [Download link] dataset to import the majority of DBpedia into our software. We also make use of the DBpedia ontology provided with the release.

WDC Web Table Corpus 2012

We use a web table corpus provided by the Web Data Commons project as web tables from which we extract new long-tail entities. For this we utilize the 2012 English-Language Relational Web Table Corpus [Download link]. Before the tables are fed into our pipeline, we match individual tables within the corpus to classes in DBpedia. This is done using the T2K Matching Framework [Ritze2015] [Code1, Code2]. Below in Section 5.2 we provide a subset of those tables already matched to DBpedia classes, which can be used to replicate our work on long-tail entity extraction.

T4LTE Dataset (Web Tables for Long-Tail Entity Extraction)

Finally, we created and used for evaluation and training purposes the T4LTE gold standard. The dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. More information about the T4LTE and download links are available at its own website. We already provide the gold standard files as part of an archive that contains the required file structure to replicate our experiments. This is described further in Section 5.2. below.

3. Code and Compilation

Our software is written in Java 1.8. You can download the full source code here. The archive consists of a maven multi-module project, that can be compile by running the following maven command on the parent module in the root directory.

mvn -amd

This will compile all modules, and resolve all dependencies to output jars in a new directory created called app. This app directory has the required structure expected by our software. This is described further in Section 5.2. below. We also provide the already compiled bytecode, allowing you to skip code compilation.  

The majority of the code is released under the MIT license. See LICENSE files within the code archive for more information.

4. Full List of User-Provided Rules for Weak Supervision

User provided rules for all classes (GF-Player, Song and Settlement) can be downloaded here.

5. Replicating Experiments

Below, we will include extensive instructions on how to replicate the experiments presented in the research listed above. We will describe the required directory and file structure, quickly introduce the caches that are generated and used throughout the experiments, and introduce the scripts that need to be run to replicate our work.

5.1. Minimum System Specifications

These are the minimum system specifications required to run our experiments:

  • Java 8 (or higher) Runtime Environment
  • Multi-Core CPU, with at least 24 cores to allow computation in acceptable time
  • At least 100 GB hard drive space
  • At least 500 GB in RAM Memory
  • A Linux-based operating system, if you wish to use the shell scripts provided

5.2. File Structure and Project Archive and Dataset Downloads

The file structure described below is needed to replicate our work. You can download the complete structure by clicking on the link next to RootDirectory. The linked archive contains the fulls structure, all scripts, the T4LTE gold standard, miscellaneous configuration files, and the compiled application. It does not include the raw datasets or any caches, these can be downloaded separately. They are also linked within the description of the file structure below.

For some of the DBpedia datasets we had to convert the archive compression from BZIP2 to GZ. This is because the original BZIP2 files provided do not allow streaming. Streaming allows the datasets to be loaded faster and more efficiently. In regards to the DBpedia as Tables dataset, the TAR archive provided includes a subdirectory termed csv. This directory is not necessary and the files within it need to be extracted directly to the directory data/rawDBpediaDatasets/DBpediaAsTables within the file structure.

RootDirectory [Download project archive]
├─ app [Compiled code and libraries, included in project archive]
├─ data
│ ├─ caches [Download ZIP file of complete caches, except FullRowClustering]
│ │ ├─ abstracts
│ │ ├─ FullRowClustering1_SimCache │ │ ├─ FullRowClustering2_BeforeKLJ
│ │ ├─ FullRowClustering3_Done
│ │ ├─ ImplicitAttributes
│ │ ├─ indegree │ │ ├─ kbs
│ │ ├─ models
│ │ ├─ PHI
│ │ ├─ tables_0
│ │ └─ tables_1
│ ├─ configs [Miscellaneous files, included in project archive]
│ ├─ goldStandard [Gold standard, included in project archive]
│ ├─ rawDBpediaDatasets
│ │ ├─ DBpediaAsTables [Download, Mirror]
│ │ │ ├─ Abbey.csv.gz
│ │ │ ├─ AcademicJournal.csv.gz │ │ │ ├─ ...... │ │ │ └─ Zoo.csv.gz
│ │ ├─ others
│ │ │ ├─ infobox_properties_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ labels_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ long_abstracts_en.nt.gz [Download in required GZ format, Original BZIP2 file] │ │ │ ├─ mappingbased_properties_en.nt.gz [Download in required GZ format, Original BZIP2 file] │ │ │ ├─ page_links_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ redirects_en.nt.gz [Download in required GZ format, Original BZIP2 file]
│ │ │ ├─ SurfaceForms_LRD-CC_filter_thr3.80.tsv.gz [Download, Mirror] │ │ │ └─ SurfaceForms_LRD-WAT_filter_thr2.60.tsv.gz [Download, Mirror] │ │ └─ dbpedia_2014.owl [Included in project archive, Original Mirror]
│ ├─ rawTables [Download]
│ └─ results
├─ 00_config.cfg
├─ 01a_importRawDBpediaAndOntology.sh
├─ 01b_importRawTables.sh ├─ 01c_runColumnMatching.sh ├─ 01d_learnModels.sh ├─ 02a_goldStandardRun.sh ├─ 02b_weakSupervisionRun.sh ├─ 02c_fullRunGridironFootballPlayer.sh
├─ 02d_fullRunSong.sh
└─ 02e_fullRunSettlement.sh

5.2. Using Caches

Throughout the software we use various caches to speed up loading times, and to cache preliminary results.  These caches are created automatically when running the scripts below. You have the option whether to download the raw datasets, and import and process them manually, or to download the precached datasets and models, to speed up the experiments by skipping cache computation. If you use precomputed caches, you can skip running the setup scripts described in Section 5.4 and you do not need any of the raw datasets.

5.3. Setting up Configuration

Within the root directory, there is a file called 00_config.cfg. This file contains configuration used by all provided shell scripts. To use the scripts, you need to change configuration parameters within this file. These are mostly paths to certain directories, but also configuration on available memory.

5.4. Setup Scripts: Importing Data and Column Matching

You can skip running these scripts if you use the caches provided.

01a_importRawDBpediaAndOntology.sh

This script imports and processes DBpedia and its ontology from the various raw DBpedia datasets. It also runs some adaptations and fixes to DBpedia dataset and its ontology. The exact code for this can be found from lines 547 to 610 in the following Java file: expansion.samples/src/main/java/de/uni_mannheim/informatik/dws/expansion/samples/setExtension/SetExtension.java. The knowledge base and the ontology are cached within caches/kbs

.

01b_importRawTables.sh

This script imports the tables from their raw CSV format, and transforms the tables into a format used throughout our experiments. The exact code can be found from lines 730 to 830 in the same file. The tables are cached in caches/tables_0.

01c_runColumnMatching.sh

This script performs the attribute-to-property matching on the imported tables. The tables with the correct correspondences are cached in caches/tables_1.

01d_learnModels.sh

This script learns and caches the models used for the large-scale profiling. The models are cached in caches/models.

5.5. Main Scripts: Replicating Experiments

02a_goldStandardRun.sh

This script runs and tests the performance of our pipeline for the task of finding new long-tail entities on the T4LTE gold standard. It also provides performance for various methods for the individual row clustering and new detection. The results are also outputted in a CSV format at the end of the standard output of the application.

02b_weakSupervisionRun.sh

This script runs the various weak supervision approaches and tests them on the gold standard. The results are also outputted in a CSV format at the end of the standard output of the application.

02c_fullRunGridironFootballPlayer.sh, 02d_fullRunSong.sh and 02e_fullRunSettlement.sh

These three scripts run the pipeline on all tables of the corpus. Various statistics are given to the standard output. Sample of 100 "new" entities per class are stored in the data/results directory for evaluation.

6. Feedback

Please send questions and feedback directly to the authors (listed above) or post them in the Web Data Commons Google Group.

7. References

  • [Cafarella2008] Cafarella, Michael J and Halevy, Alon Y and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web.", In Proceedings of the 11th International Workshop on the Web and Databases (WebDB '08).
  • [Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD '14). New York, NY, USA , pp. 601-610. ACM.
  • [Hoffart2013] Hoffart, Johannes and Suchanek, Fabian M. and Berberich, Klaus and Weikum, Gerhard (2013), "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia", Artificial Intelligence. Vol. 194, pp. 28-61. Elsevier.
  • [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, Sören and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
  • [Oulabi2019a] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In Advances in Database Technology - 22nd International Conference on Extending Database Technology (EDBT '19), Lisbon, Portugal, March 26-29, 2019.
  • [Oulabi2019b] Oulabi, Yaser and Bizer, Christian (2019), "Using Weak Supervision to Identify Long-Tail Entities for Knowledge Base Completion", In Proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, Karlsruhe, Germany, September 9-12, 2019.
  • [Oulabi2020] Oulabi, Yaser (2020), "Augmenting Cross-Domain Knowledge Bases Using Web Tables", PhD thesis, University of Mannheim, Mannheim, Germany.
  • [Ratner2017] Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and Ré, Christopher (2017), "Snorkel: Rapid Training Data Creation with Weak Supervision", Proceedings of the VLDB Endowment, November, 2017. Vol. 11(3), pp. 269-282. VLDB Endowment.
  • [Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (WIMS '15). New York, NY, USA , pp. 10:1-10:6. ACM.
  • [Vrandecic2014] Vrandečić, Denny and Krötzsch, Markus (2014), "Wikidata: A Free Collaborative Knowledgebase", Communications of the ACM. Vol. 57(10), pp. 78-85. ACM.