The extraction of data from unstructured or semi-structured web sources has been recognized as a suitable way of populating the data web. Since many pages already use embedded structured data, e.g., as RDFa, Microformats, or Microdata, this information can be used to bootstrap and train supervised systems extracting structured data from the web. In this challenge, we want to compare systems using such annotated pages for extracting information from the web.
The challenge will be held at the Linked Data for Information Extraction (LD4IE) workshop 2014, co-located with the International Semantic Web Conference. The authors of the best performing system are awarded a 250 € book voucher, kindly sponsored by Springer.
Contents
1. Task Description
The challenge task is to write an information extraction system that will scrape structured information from HTML web sites. To train the system, training data is made available as web sites containing markup, in particular, microformat data using the hcard format.
2. Datasets
To create a gold standard, we have used a subset of files from the Web Data Commons Microformats dataset, which contains a set of web pages extracted from the Common Crawl containing microformats annotations. This year's challenge focuses on data in the hcard format.
3.1 Training Data
The training data can be found here. It has been divided into four sets of files for better handling. The union of all four sets may be used for training. For each set, we provide the following files:
- *.html.txt.gz: Includes the original HTML code with all annotations
- *.clean.html.txt.gz: Includes a cleaned version of the HTML code with no annotations. This filetype will be given to the participants also as test datasets (see below).
- *.nq: Includes the statements as quads, which can be extracted from the *.html.txt.gz file using Any23 library.
3.2 Test Data
The test data can be found here. It consists of files which originally contained hcard microformats. All of those annotations have been removed, so that the web pages can be treated as non-annotated web pages. However, the system can expect each of the websites to contain data which may be expressed with the hcard format.
3.3 File Formats
_*(.clean).html.txt.gz_: These files include a collection of crawled HTML pages. Each page is represented in the file in three following:
- 1. URI: [URI of the crawled page]
- 2. Content-Type: [Content Type including detect charset of the crawled page]
- 3. Content: [The actuall HTML content without linebreaks. In case of the *.clean.* version, annotations and comments where removed from the HTML code]
_*.nq_: These files include the nquad representation of all microformats hcard related instances which could be extracted from the original HTML code of all pages included in the corresponding file using the Any23 library.
3. Submission Instructions & Dates
The following picture illustrates the files to be used and to be produced in the challenge. The training files are used to train an extraction system, which is executed on the test set of HTML files. The resulting extracted quads are submitted to the challenge organizers for evaluation.
To take part in the challenge, please submit an nquads (.nq), formatted like the result files provided with the training sets, which contains the triples extracted from the test dataset of HTML pages, along with a short paper describing your approach, as well as some of your own evaluation results and findings. The paper must be formatted in Springer LNCS style and must not exceed four pages.
To make the submission, place both the n-quads file and the paper in a ZIP archive, and submit it via Easychair.
Timeline:
- September 12, 2014: Submission of results and system descriptions
- September 19, 2014: Announcement of challenge results
- September 26th, 2014: Camera ready version of system descriptions due
- October 19th or 20th, 2014: Presentation of results at the LD4IE workshop
4. Evaluation
We will evaluate the submitted results based on recall, precision, and F-measure on statement level. A statement in the submitted results is counted as a true positive if it matches a statement in the gold standard, i.e., its subject, predicate, object, and origin are identical. Two blank nodes are always considered identical, i.e., we follow the RDF semantics specification for equality of RDF documents.
5. Contact
For questions about the challenge, please contact Heiko Paulheim.
6. Acknowledgements
The challenge has been kindly supported by the Web Data Commons project, and by Springer.