TweetNERD - End to End Entity Linking Benchmark for Tweets

TweetNERD - End to End Entity Linking Benchmark for Tweets Paper - Video - Neurips Page This is the dataset described in the paper TweetNERD - End to End Entity Linking Benchmark for Tweets (accepted to Thirty-sixth Conference on Neural Information Processing Systems (Neurips) Datasets and Benchmarks Track). Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area. UPDATE: The new version contains an additional ~125K Tweets leading to a total dataset size of ~465K Tweets. TweetNERD dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0) LICENSE. The license only applies to the data files present in this dataset. See Data usage policy below. Check out more details at https://github.com/twitter-research/TweetNERD Usage We provide the dataset split across the following tab seperated files: OOD.public.tsv: OOD split of the data in the paper. Academic.public.tsv: Academic split of the data described in the paper. part_*.public.tsv: Remaining data split into parts in no particular order. Each file is tab separated and has has the following format: tweet_id phrase start end entityId score 22 twttr 20 25 Q918 3 21 twttr 20 25 Q918 3 1457198399032287235 Diwali 30 38 Q10244 3 1232456079247736833 NO_PHRASE -1 -1 NO_ENTITY -1 For tweets which don't have any entity, their column values for phrase, start, end, entityId, score are set NO_PHRASE, -1, -1, NO_ENTITY, -1 respectively. Description of file columns is as follows: Column Type Missing Value Description tweet_id string ID of the Tweet phrase string NO_PHRASE entity phrase start int -1 start offset of the phrase in text using UTF-16BE encoding end int -1 end offset of the phrase in the text using UTF-16BE encoding entityId string NO_ENTITY Entity ID. If not missing can be NOT FOUND, AMBIGUOUS, or Wikidata ID of format Q{numbers}, e.g. Q918 score int -1 Number of annotators who agreed on the phrase, start, end, entityId information In order to use the dataset you need to utilize the tweet_id column and get the Tweet text using the Twitter API (See Data usage policy section below). Data stats Split Number of Rows Number unique tweets OOD 34102 25000 Academic 51685 30119 part_0 11830 10000 part_1 35681 25799 part_2 34256 25000 part_3 36478 25000 part_4 37518 24999 part_5 36626 25000 part_6 34001 24984 part_7 34125 24981 part_8 32556 25000 part_9 32657 25000 part_10 32442 25000 part_11 32033 24972 part_12 76559 25000 part_13 67240 24920 part_14 67745 25000 part_15 67652 25000 part_16 65739 25000 Data usage policy Use of this dataset is subject to you obtaining lawful access to the Twitter API, which requires you to agree to the Developer Terms Policies and Agreements. Please cite the following if you use TweetNERD in your paper: @dataset{TweetNERD_Zenodo_2022_6617192, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, title = {{TweetNERD - End to End Entity Linking Benchmark for Tweets}}, month = jun, year = 2022, note = {{Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs /twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en /developer-terms/).}}, publisher = {Zenodo}, version = {0.0.0}, doi = {10.5281/zenodo.6617192}, url = {https://doi.org/10.5281/zenodo.6617192} } @inproceedings{TweetNERDNeurips2022, author = {Mishra, Shubhanshu and Saini, Aman and Makki, Raheleh and Mehta, Sneha and Haghighi, Aria and Mollahosseini, Ali}, booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks}, pages = {}, title = {TweetNERD - End to End Entity Linking Benchmark for Tweets}, volume = {2}, year = {2022}, eprint = {arXiv:2210.08129}, doi = {10.48550/arXiv.2210.08129} }

Data usage policy Use of this dataset is subject to you obtaining lawful access to the [Twitter API](https://developer.twitter.com/en/docs/twitter-api), which requires you to agree to the [Developer Terms Policies and Agreements](https://developer.twitter.com/en/developer-terms/).

Keywords

Twitter, Social Media, Tweet, Entity Linking, Named Entity Recognition, Wikidata

EOSC Subjects

Twitter Data

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	114
download	downloads	112

114
views
112
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

114

112