Task Description and Data
Heiko Paulheim
Robert Meusel

The extraction of data from unstructured or semi-structured web sources has been recognized as a suitable way of populating the data web. Since many pages already use embedded structured data, e.g., as RDFa, Microformats, or Microdata, this information can be used to bootstrap and train supervised systems extracting structured data from the web. In this challenge, we want to compare systems using such annotated pages for extracting information from the web.

The challenge will be held at the Linked Data for Information Extraction (LD4IE) workshop 2014, co-located with the International Semantic Web Conference. The authors of the best performing system are awarded a 250 € book voucher, kindly sponsored by Springer.

Contents

1. Task Description

The challenge task is to write an information extraction system that will scrape structured information from HTML web sites. To train the system, training data is made available as web sites containing markup, in particular, microformat data using the hcard format.

2. Datasets

To create a gold standard, we have used a subset of files from the Web Data Commons Microformats dataset, which contains a set of web pages extracted from the Common Crawl containing microformats annotations. This year's challenge focuses on data in the hcard format.

3.1 Training Data

The training data can be found here. It has been divided into four sets of files for better handling. The union of all four sets may be used for training. For each set, we provide the following files:

3.2 Test Data

The test data can be found here. It consists of files which originally contained hcard microformats. All of those annotations have been removed, so that the web pages can be treated as non-annotated web pages. However, the system can expect each of the websites to contain data which may be expressed with the hcard format.

3.3 File Formats

_*(.clean).html.txt.gz_: These files include a collection of crawled HTML pages. Each page is represented in the file in three following:

_*.nq_: These files include the nquad representation of all microformats hcard related instances which could be extracted from the original HTML code of all pages included in the corresponding file using the Any23 library.

3. Submission Instructions & Dates

The following picture illustrates the files to be used and to be produced in the challenge. The training files are used to train an extraction system, which is executed on the test set of HTML files. The resulting extracted quads are submitted to the challenge organizers for evaluation.

To take part in the challenge, please submit an nquads (.nq), formatted like the result files provided with the training sets, which contains the triples extracted from the test dataset of HTML pages, along with a short paper describing your approach, as well as some of your own evaluation results and findings. The paper must be formatted in Springer LNCS style and must not exceed four pages.

To make the submission, place both the n-quads file and the paper in a ZIP archive, and submit it via Easychair.

Timeline:

4. Evaluation

We will evaluate the submitted results based on recall, precision, and F-measure on statement level. A statement in the submitted results is counted as a true positive if it matches a statement in the gold standard, i.e., its subject, predicate, object, and origin are identical. Two blank nodes are always considered identical, i.e., we follow the RDF semantics specification for equality of RDF documents.

5. Contact

For questions about the challenge, please contact Heiko Paulheim.

6. Acknowledgements

The challenge has been kindly supported by the Web Data Commons project, and by Springer.