Time-Aware Fusion for Web Table Data

Yaser Oulabi and Christian Bizer, 01.07.2019 (last updated on 30.07.2020)

This website provides resources relevant to our research on the topic of time-aware fusion for web table data, for the purpose of slot filling a cross-domain knowledge base with time-dependent data.

Cross-domain knowledge bases like YAGO [Hoffart2013], DBpedia [Lehmann2015], Wikidata [Vrandecic2014], or the Google Knowledge Graph are being employed for an increasing range of applications, including natural language processing, web search, and question answering. As the usefulness of a knowledge base increases with its completeness, slot filling missing values in cross-domain knowledge bases is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases with new knowledge.

In slot filling, a primary problem is truth discovery, i.e. finding the correct value given a set of conflicting inputs. This requires the use of fusion methods. However, in knowledge bases there also exists time-dependent data, where the validity of a fact is additionally dependent on a certain temporal scope [Kuzey2012], i.e. a point in time or a time range. Slot filling time-dependent data requires fusion methods that are time-aware [Dong2016]. Time-aware fusion is the task of finding, among a set of conflicting values, the value that is, in addition to being correct, valid for a given temporal scope. A primary concern of time-aware fusion is therefore the estimation of temporal scope annotations, which web table data lacks.

In our research on time-aware fusion, we have introduced two time-aware fusion approaches, described in two published works:

In [Oulabi2016], we introduce TT-Weighting, where we extract timestamps from the table and its context to exploit as temporal scopes, additionally introducing approaches to reduce the sparsity and noisiness of these timestamps.
In [Oulabi2017], we introduce Timed-KBT, we exploit a temporal knowledge base to propagate temporal scopes to web table data, reducing the dependence on noisy and sparse timestamps.

For both approaches, we will link to our code and to datasets used. We will also link to instructions to replicate our work.

As part of our research, we have introduced the Time-Dependent Ground Truth (TDGT). It is built using the schema of and entities from Wikidata. Within the ground truth, we integrate data from 5 different sources, for 7 topical domains and overall 19 time-dependent properties, covering more than 180 thousand entities with more than one million temporal facts. TDGT could be used for a variety of tasks that make use of the temporal aspect of time-dependent data. It is released as part of the Web Data Commons projet.

1. TT-Weighting: Exploiting Timestamps for Time-Aware Fusion

As part of TT-Weighting [Oulabi2016], we first present a taxonomy of timestamp types. This taxonomy allows us to introduce methods that consider the locations from which individual timestamps were extracted. To reduce timestamp sparsity, we introduce an approach that propagates timestamp information along these timestamp type by individual web tables values. We then introduce a regression approach, that weights the importance of each timestamp type given a certain property of the knowledge base schema. This would potentially discover a relationship between a property and specific locations of timestamps in and around the web table.

1.1 Replicating our research

Technical requirements:

280 GB of RAM Memory
Java 8
Maven

Datasets required

Matched web table corpus with timestamps
Reference-To-Instance Mappings for columns
PageRank ranking file
Ground truth (included with code)

The code can be downloaded here. It includes the ground truth and instructions on how to replicate our work.

This code can be used to replicate our work in [Oulabi2016], if you are interested in using the TT-Weighting methods for other datasets, we recommend using the code provided for Timed-KBT, which also provides a full implementation for TT-Weighting.

2. Timed-KBT: Estimating Temporal Scopes Using Knowledge-Based Trust

With Timed-KBT [Oulabi2017], we introduce an approach that reduces the dependence on timestamp information for time-aware fusion. Using a temporal knowledge base, we propagate temporal scopes to web table columns by exploiting the overlap of web table data with data in the knowledge base. Combing Timed-KBT with timestamp information by restricting the temporal scopes that can be propagated to web table data to those also described in extractable timestamps yields a precision-oriented time-aware fusion method.

2.1 Replicating our research

Technical requirements:

320 GB of RAM Memory
Java 8
Maven (when not using binaries)

Datasets used:

2015 WDC web table corpus (included in project archive)
Time-Dependent Ground Truth (included in project archive)

2.2 Code and Compilation

Our software is written in Java 1.8. You can download the full source code here. The archive consists of a maven multi-module project, that can be compile by running the following maven command on the parent module in the root directory.

mvn -amd

This will compile all modules, and resolve all dependencies to output jars in a new directory created called app. This app directory has the required structure expected by our software. We also provide the already compiled bytecode, allowing you to skip code compilation.

The majority of the code is released under the MIT license. See LICENSE files within the code archive for more information.

2.3. File Structure and Project Archive Download

The project has the following file structure. The included project archive contains all compiled code and datasets to run our experiments.

RootDirectory [Download project archive]
├─ app [Compiled binaries and required libraries, included in project archive]
├─ data
│  ├─ caches [all caches included in project archive]
│  │  ├─ kb [TDGT, see 2.3.1 below]
│  │  ├─ lucene [Lucene index for TDGT, see 2.3.1 below]
│  │  ├─ maxDiffs [learned maximum year differences per property, see 2.3.3 below]
│  │  ├─ ontology [schema of TDGT, see 2.3.1 below]
│  │  └─ tables [cached tables, see 2.3.2 below]
│  ├─ configs [Miscellaneous files, included in project archive] 
│  ├─ results 
│  └─ t2k_output[Download matched tables in JSON format, see 2.3.2 below]
├─ config.cfg
└─ run.sh

2.3.1 Time-Dependent Ground Truth

The kb folder contains the TDGT dataset in a cached binary serialized format. This format can be read only from Java, and is more efficiently to load than TDGT in the raw JSON format.

The lucene folder contains a Lucene index generated for the entities in TDGT. While we provide the lucene index, it is automatically generated if missing.

The ontology folder contains the schema of TDGT in a binary serialized format.

2.3.2 Matched web tables

We provide two versions of web tables from the 2015 WDC web table corpus. Both are already matched to the schema of and the entities in the TDGT.

The first is a version cached using a binary serialized format. It is fully processed and all timestamps are already parsed using HeidelTime. The second version, while matched to the TDGT, the tables and their metadata are in a JSON format. This is the format created as output by T2K. This JSON Format allows one to manually look at the tables using any text editor, whereas the cached version can only be loaded in Java.

To match the raw web table corpus to the TDGT, you can use the MatchFull class in expansion.matching/src/main/java/de/uni_mannheim/informatik/dws/matching. It uses T2K as an underlying framework, but matches web tables to the TDGT instead of to DBpedia.

2.3.3 Learned maximum year differences for neighborhood scope estimation

Per property, we learn for the Timed-KBT and the restricted Timed-KBT separately the number of maximum years to consider for neighborhood scope estimation. While we provide cached versions, the maximum differences are learned automatically if the cached versions are not present.

3. Feedback

Please send questions and feedback directly to the authors (listed above) or post them in the Web Data Commons Google Group.

4. References

[Cafarella2008] Cafarella, Michael J and Halevy, Alon Y and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web.", In Proceedings of the 11th International Workshop on the Web and Databases (WebDB '08).
[Dong2016] Dong, X. L., Kementsietsidis, A., and Tan, W.-C. (2016). A time machine for information: Looking back to look forward. SIGMOD Record, 45(2):23–32.
[Hoffart2013] Hoffart, Johannes and Suchanek, Fabian M. and Berberich, Klaus and Weikum, Gerhard (2013), "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia", Artificial Intelligence. Vol. 194, pp. 28-61. Elsevier.
[Kuzey2012] Kuzey, E. andWeikum, G. (2012). Extraction of temporal facts and events from wikipedia. In Proceedings of the 2nd Temporal Web Analytics Workshop, TempWeb ’12, page 25–32, New York, NY, USA. Association for Computing Machinery.
[Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, Sören and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
[Oulabi2016] Yaser Oulabi, Robert Meusel, and Christian Bizer. (2016). Fusing time-dependent web table data. In Proceedings of the 19th International Workshop on Web and Databases (WebDB '16). ACM, New York, NY, USA, , Article 3 , 7 pages.
[Oulabi2017] Yaser Oulabi, and Christian Bizer. (2017). Estimating Missing Temporal Meta-Information using Knowledge-Based-Trust. In Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB (KDWeb '16). CEUR Workshop Proceedings, RWTH: Aachen.
[Vrandecic2014] Vrandečić, Denny and Krötzsch, Markus (2014), "Wikidata: A Free Collaborative Knowledgebase", Communications of the ACM. Vol. 57(10), pp. 78-85. ACM.