Temporally-Informed Analysis of Named Entity Recognition

Name: Temporally-Informed Analysis of Named Entity Recognition
Keywords: twitter ner, temporal analysis, named entity recognition, ner, twitter, information extraction, tweets

Research datakeyboard_double_arrow_right Dataset 17 Jun 2020 English Publisher:Zenodo

Authors: Rijhwani, Shruti; Preoțiuc-Pietro, Daniel;

doi: 10.5281/zenodo.3899040 , 10.5281/zenodo.3899039

Temporally-Informed Analysis of Named Entity Recognition

- Summary
- Subjects
- Metrics

Abstract

This repository contains the data set developed for the paper: “Shruti Rijhwani and Daniel Preoțiuc-Pietro. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the Association for Computational Linguistics (ACL). 2020.” It includes 12,000 tweets annotated for the named entity recognition task. The tweets are uniformly distributed over the years 2014-2019, with 2,000 tweets from each year. The goal is to have a temporally diverse corpus to account for data drift over time when building NER models. The entity types annotated are locations (LOC), persons (PER) and organizations (ORG). The tweets are preprocessed to replace usernames and URLs with a unique token. Hashtags are left intact and can be annotated as named entities. Format The repository contains the annotations in JSON format. Each year-wise file has the tweet IDs along with token-level annotations. The Public Twitter Search API (https://developer.twitter.com/en/docs/tweets/search) can be used extract the text for the tweet corresponding to the tweet IDs. Data Splits Typically, NER models are trained and evaluated on annotations available at the model building time, but are used to make predictions on data from a future time period. This setup makes the model susceptible to temporal data drift, leading to lower performance on future data as compared to the test set. To examine this effect, we use tweets from the years 2014-2018 as the training set and random splits of the 2019 tweets as the development and test sets. These splits simulate the scenario of making predictions on data from a future time period. The development and test splits are provided in the JSON format. Use Please cite the data set and the accompanying paper if you found the resources in this repository useful.

Keywords

twitter ner, temporal analysis, named entity recognition, ner, twitter, information extraction, tweets

EOSC Subjects

Twitter Data

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	122
download	downloads	24

122
views
24
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average

122