Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

CREXWET-SYNTH: CREXdata-Weather-Emergency-Twitter-SYNTHetic

Authors: Orama, Jonathan Ayebakuro; Juvillà Garcia, Marc; Melero, Maite;

CREXWET-SYNTH: CREXdata-Weather-Emergency-Twitter-SYNTHetic

Abstract

Dataset Summary The CREXWET-SYNTH dataset is a multilingual corpus of synthetically generated texts in the form of tweets with labels according to their relevance to a flash flood, or wildfire, for weather event detection. The dataset contains 37k (9k German, 10k English, 9k Spanish, 8k Catalan) unique sentences. Supported Tasks and Leaderboards text-classification: This dataset can be used to train models for flood and wildfire detection. Languages The languages included in the dataset are: - English (`en`) - German (`de`) - Spanish (`es`) - Catalan (`ca-ES`) Dataset Structure Data Instances Each instance in the dataset contains the following fields: - id: index id specific to this dataset. - text: generated text in form of tweet. - language: language in sentence. Possible values: ENGLISH, GERMAN, SPANISH, CATALAN. - label: event label. Posible values: fire, flood, none. - label_quality: label quality score provided by Cleanlab. - model: LLM used to generate text. - prompt_category: type of prompt used to generate the text. Data Fields { "id": "0", "text": "Ufff, quin partit del Barça ahir! 😍 Sembla que la Lliga ja és nostra! #ForçaBarça #ViscaElBarça", "language": "CATALAN", "label": "none", "label_quality": "0.995009975117508", "model": "gemma_3", "prompt_category": "unrelated_to_crisis_discussing_random_topics"} Dataset Creation Curation Rationale This dataset was created to augument real data used to train a weather emergency detection model within the CREXDATA project (Grant Agreement No. 101092749). Source Data Initial Data Collection and Normalization The data was generated using the 8-bit quantized versions of Google’s Gemma 3 27B and MistralAI’s Mistral Small 24B. The models were prompted to generate texts in the following categories: - from_affected_persons_with_keywords: In this case, the LLM was instructed to generate social media posts written from the perspective of individuals affected by a wildfire or flood. The prompt included guidance on tone and content, along with a list of keywords to incorporate. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt. - from_government_and_meteorological_agencies_with_warning_alerts: The LLM was instructed to simulate posts issued by government or meteorological agencies, particularly those providing public warnings and alerts. This helped introduce an institutional perspective into the dataset. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt. - `unrelated_to_crisis_discussing_random_topics`: To simulate unrelated social media posts, the LLM was prompted to produce posts about various topics such as politics, celebrities, cancer, music, sports, food, lifestyle, memes, breaking news, personal updates, and tourism. These topics were chosen to represent general social media noise. These are annotated as 'nones'. - related_to_crisis_but_lacking_useful_information: Here, the LLM was instructed to generate posts mentioning the incident but without providing actionable content—such as those requesting donations, expressing sympathy, criticizing authorities, or promoting conspiracy theories. These are annotated as 'nones'. The prompts for each category can be found at this repository. Who are the source language producers? The source language produced by the LLMs used for generation. Annotations Annotation process The annotations were produced by the LLMs. They were further cleaned using Cleanlab following their documentation. Error labels and low quality labels below 0.3 were dropped, we further re-labelled instances of `fire, flood` with label quality less than 0.7 as `none`. Our annotation and label cleaning process can be found at the repository. Who are the annotators? The annotators were the LLMs used for generation. Personal and Sensitive Information N/A Considerations for Using the Data Social Impact of Dataset We hope this data can improve research into weather emergency detection in social media data. Discussion of Biases We are aware that, since the data comes from LLMs, they can contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact. Other Known Limitations The dataset is fully generated by LLMs and should be used solely for augmenting real data. Additional Information Dataset Curators Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center. This work has been developed under the EU-funded CREXDATA Project (Grant Agreement No. 101092749). Since, part of this data was generated using Google's Gemma 3 model, its usage should follow Terms of Use and Prohibited Use Policy. Licensing Information Creative Commons Attribution 4.0.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Funded by
Related to Research communities
Cancer Research