*** Folders *** * lrd-wat - contains surface forms for DBpedia entities extracted from DBpedia labels, redirects and disambiguations, and from anchor texts of internal Wikipedia links. Contains not-filtered dataset and two datasets filtered at TF-IDF thresholds equal to 1.8 and 2.6. * lrd-cc - contains surface forms for DBpedia entities extracted from anchor texts of external Wikipedia links from the Common Crawl corpus http://blog.commoncrawl.org/2015/01/december-2014-crawl-archive-available/ http://data.dws.informatik.uni-mannheim.de/structureddata/2014-12/wikianchor/ Contains not-filtered dataset and three datasets filtered at TF-IDF thresholds equal to 1.8, 2.6 and 3.8. * gold - contains two annotated subsets of lrd-wat not-filtered dataset *** File format *** * Wikipedia/DBpedia page * surface form string * TFIDFscore; empty if surface form does not come from anchor texts * whether surface form comes from Wikipedia labels (L0), redirects (L1), disambiguations (L2) or intersection of the last two (L3); empty the surface string does not come from any of these sources * page-surface from pair count; empty if surface form does not come from anchor texts * page count; empty if surface form does not come from anchor texts * surface form count; empty if surface form does not come from anchor texts *** Gold standard *** "popular" set - manually selected 34 popular DBpedia entities "random" set - randomly selected 81 entity each having at least 5 surface forms Annotation codes: * correct ("ok"), with the meaning that a given surface form can indeed be used as an alternative name for a corresponding entity ("the eternal city" for Rome or "red planet" for Mars), * related, contained ("oi"), when the surface form is part of the entity ("Sao Paulo, Brazil" for Brazil or "Google Japan" for Google), * related, contains ("og"), when the surface form contains the entity ("Turkey" for Istanbul), * related, type of ("g"), when the surface form generalizes the entity ("the city" for Rome or "book" for The Da Vinci Code), * related, partial ("p"), when the surface form is an ambiguous partial reference to the entity, * related ("r"), for the numerous case of the related entities ("Google Blog" for Google, "Martian surface temperatures" for Mars); * wrong ("w"), for surface forms that do not refer to the entity ("during World War I" for United States); * wrong, formatting ("f"), e.g. surface forms with residual tags.