WD50K

WD50K dataset: An hyper-relational dataset derived from Wikidata statements. The dataset is constructed by the following procedure based on the [Wikidata RDF dump](https://dumps.wikimedia.org/wikidatawiki/20190801/) of August 2019: - A set of seed nodes corresponding to entities from FB15K-237 having a direct mapping in Wikidata (P646 "Freebase ID") is extracted from the dump. - For each seed node, all statements whose main object and qualifier values corresponding to wikibase:Item are extracted from the dump. - All literals are filtered out from the qualifiers of the above obtained statements. - All the entities from the dataset which have less than two mentions are dropped. The statements corresponding to the dropped entities are also dropped. - The remaining statements are randomly split into the train, test, and validation sets. - All statements from train and validation sets are removed which share the same main triple (s,p,o) with test statements. - WD50k_33, WD50k_66, WD50k_100 are then sampled from the above statements. Here 33, 66, 100 represents the amount of hyper-relational facts (statements with qualifiers) in the dataset. The table below provides some basic statistics of our dataset and its three further variations: | Dataset | Statements | w/Quals (%) | Entities | Relations | E only in Quals | R only in Quals | Train | Valid | Test | |-------------|------------|----------------|----------|-----------|-----------------|-----------------|---------|--------|--------| | WD50K | 236,507 | 32,167 (13.6%) | 47,156 | 532 | 5460 | 45 | 166,435 | 23,913 | 46,159 | | WD50K (33) | 102,107 | 31,866 (31.2%) | 38,124 | 475 | 6463 | 47 | 73,406 | 10,668 | 18,133 | | WD50K (66) | 49,167 | 31,696 (64.5%) | 27,347 | 494 | 7167 | 53 | 35,968 | 5,154 | 8,045 | | WD50K (100) | 31,314 | 31,314 (100%) | 18,792 | 279 | 7862 | 75 | 22,738 | 3,279 | 5,297 | When using the dataset please cite: @inproceedings{StarE, title={Message Passing for Hyper-Relational Knowledge Graphs}, author={Galkin, Mikhail and Trivedi, Priyansh and Maheshwari, Gaurav and Usbeck, Ricardo and Lehmann, Jens}, booktitle={EMNLP}, year={2020} } For any further questions, please contact: mikhail.galkin@iais.fraunhofer.de

Funding sources - SPEAKER : 01MK20011A - JOSEPH : Fraunhofer Zukunftsstiftung - Cleopatra : 812997 - ML2R: 01 15 18038 A/B/C - MLwin: 01IS18050 D/F - ScADS: 01IS18026A

Related Organizations

University of Bonn
Germany
TU Dresden
Germany

Keywords

Wikidata, Knowledge Graph, Link prediction, Hyper Relational Graph, Graph Convolutional Network, Natural Language Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average