Contents
1. TT-Weighting: Exploiting Timestamps for Time-Aware Fusion
As part of TT-Weighting [Oulabi2016], we first present a taxonomy of timestamp types. This taxonomy allows us to introduce methods that consider the locations from which individual timestamps were extracted. To reduce timestamp sparsity, we introduce an approach that propagates timestamp information along these timestamp type by individual web tables values. We then introduce a regression approach, that weights the importance of each timestamp type given a certain property of the knowledge base schema. This would potentially discover a relationship between a property and specific locations of timestamps in and around the web table.
1.1 Replicating our research
Technical requirements:- 280 GB of RAM Memory
- Java 8
- Maven
- Matched web table corpus with timestamps
- Reference-To-Instance
Mappings for columns
- PageRank ranking file
- Ground truth (included with code)
The code can be downloaded here. It includes the ground truth and instructions on how to replicate our work.
This code can be used to replicate our work in [Oulabi2016], if you are interested in using the TT-Weighting methods for other datasets, we recommend using the code provided for Timed-KBT, which also provides a full implementation for TT-Weighting.
2. Timed-KBT: Estimating Temporal Scopes Using Knowledge-Based Trust
With Timed-KBT [Oulabi2017], we introduce an approach that reduces the dependence on timestamp information for time-aware fusion. Using a temporal knowledge base, we propagate temporal scopes to web table columns by exploiting the overlap of web table data with data in the knowledge base. Combing Timed-KBT with timestamp information by restricting the temporal scopes that can be propagated to web table data to those also described in extractable timestamps yields a precision-oriented time-aware fusion method.
2.1 Replicating our research
Technical requirements:
- 320 GB of RAM Memory
- Java 8
- Maven (when not using binaries)
- 2015 WDC web table corpus (included in project archive)
- Time-Dependent Ground Truth (included in project archive)
2.2 Code and Compilation
Our software is written in Java 1.8. You can download the full source code here. The archive consists of a maven multi-module project, that can be compile by running the following maven command on the parent module in the root directory.
mvn -amd
This will compile all modules, and resolve all dependencies to output jars in a new directory created called app. This app directory has the required structure expected by our software. We also provide the already compiled bytecode, allowing you to skip code compilation.
The majority of the code is released under the MIT license. See LICENSE files within the code archive for more information.2.3. File Structure and Project Archive Download
The project has the following file structure. The included project archive contains all compiled code and datasets to run our experiments.
RootDirectory [Download project archive]
├─ app [Compiled binaries and required libraries, included in project archive]
├─ data
│ ├─ caches [all caches included in project archive]
│ │ ├─ kb [TDGT, see 2.3.1 below] │ │ ├─ lucene [Lucene index for TDGT, see 2.3.1 below]
│ │ ├─ maxDiffs [learned maximum year differences per property, see 2.3.3 below]
│ │ ├─ ontology [schema of TDGT, see 2.3.1 below]
│ │ └─ tables [cached tables, see 2.3.2 below]
│ ├─ configs [Miscellaneous files, included in project archive]
│ ├─ results
│ └─ t2k_output[Download matched tables in JSON format, see 2.3.2 below]
├─ config.cfg
└─ run.sh
2.3.1 Time-Dependent Ground Truth
The kb folder contains the TDGT dataset in a cached binary serialized format. This format can be read only from Java, and is more efficiently to load than TDGT in the raw JSON format.
The lucene folder contains a Lucene index generated for the entities in TDGT. While we provide the lucene index, it is automatically generated if missing.
The ontology folder contains the schema of TDGT in a binary serialized format.
2.3.2 Matched web tables
We provide two versions of web tables from the 2015 WDC web table corpus. Both are already matched to the schema of and the entities in the TDGT.
The first is a version cached using a binary serialized format. It is fully processed and all timestamps are already parsed using HeidelTime. The second version, while matched to the TDGT, the tables and their metadata are in a JSON format. This is the format created as output by T2K. This JSON Format allows one to manually look at the tables using any text editor, whereas the cached version can only be loaded in Java.
To match the raw web table corpus to the TDGT, you can use the MatchFull class in expansion.matching/src/main/java/de/uni_mannheim/informatik/dws/matching. It uses T2K as an underlying framework, but matches web tables to the TDGT instead of to DBpedia.
2.3.3 Learned maximum year differences for neighborhood scope estimation
Per property, we learn for the Timed-KBT and the restricted Timed-KBT separately the number of maximum years to consider for neighborhood scope estimation. While we provide cached versions, the maximum differences are learned automatically if the cached versions are not present.
3. Feedback
Please send questions and feedback directly to the authors
(listed above) or post them in the Web
Data Commons Google Group.
4. References
- [Cafarella2008] Cafarella, Michael J and Halevy, Alon Y and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web.", In Proceedings of the 11th International Workshop on the Web and Databases (WebDB '08).
- [Dong2016] Dong, X. L., Kementsietsidis, A., and Tan, W.-C. (2016). A time machine for information: Looking back to look forward. SIGMOD Record, 45(2):23–32.
- [Hoffart2013] Hoffart, Johannes and Suchanek, Fabian M. and Berberich, Klaus and Weikum, Gerhard (2013), "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia", Artificial Intelligence. Vol. 194, pp. 28-61. Elsevier.
- [Kuzey2012] Kuzey, E. andWeikum, G. (2012). Extraction of temporal facts and events from wikipedia. In Proceedings of the 2nd Temporal Web Analytics Workshop, TempWeb ’12, page 25–32, New York, NY, USA. Association for Computing Machinery.
- [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, Sören and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
- [Oulabi2016] Yaser Oulabi, Robert Meusel, and Christian Bizer. (2016). Fusing time-dependent web table data. In Proceedings of the 19th International Workshop on Web and Databases (WebDB '16). ACM, New York, NY, USA, , Article 3 , 7 pages.
- [Oulabi2017] Yaser Oulabi, and Christian Bizer. (2017). Estimating Missing Temporal Meta-Information using Knowledge-Based-Trust. In Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB (KDWeb '16). CEUR Workshop Proceedings, RWTH: Aachen.
- [Vrandecic2014] Vrandečić, Denny and Krötzsch, Markus (2014), "Wikidata: A Free Collaborative Knowledgebase", Communications of the ACM. Vol. 57(10), pp. 78-85. ACM.