CREXWET-SYNTH: CREXdata-Weather-Emergency-Twitter-SYNTHetic

Dataset Summary The CREXWET-SYNTH dataset is a multilingual corpus of synthetically generated texts in the form of tweets with labels according to their relevance to a flash flood, or wildfire, for weather event detection. The dataset contains 37k (9k German, 10k English, 9k Spanish, 8k Catalan) unique sentences. Supported Tasks and Leaderboards text-classification: This dataset can be used to train models for flood and wildfire detection. Languages The languages included in the dataset are: - English (`en`) - German (`de`) - Spanish (`es`) - Catalan (`ca-ES`) Dataset Structure Data Instances Each instance in the dataset contains the following fields: - id: index id specific to this dataset. - text: generated text in form of tweet. - language: language in sentence. Possible values: ENGLISH, GERMAN, SPANISH, CATALAN. - label: event label. Posible values: fire, flood, none. - label_quality: label quality score provided by Cleanlab. - model: LLM used to generate text. - prompt_category: type of prompt used to generate the text. Data Fields { "id": "0", "text": "Ufff, quin partit del Barça ahir! 😍 Sembla que la Lliga ja és nostra! #ForçaBarça #ViscaElBarça", "language": "CATALAN", "label": "none", "label_quality": "0.995009975117508", "model": "gemma_3", "prompt_category": "unrelated_to_crisis_discussing_random_topics"} Dataset Creation Curation Rationale This dataset was created to augument real data used to train a weather emergency detection model within the CREXDATA project (Grant Agreement No. 101092749). Source Data Initial Data Collection and Normalization The data was generated using the 8-bit quantized versions of Google’s Gemma 3 27B and MistralAI’s Mistral Small 24B. The models were prompted to generate texts in the following categories: - from_affected_persons_with_keywords: In this case, the LLM was instructed to generate social media posts written from the perspective of individuals affected by a wildfire or flood. The prompt included guidance on tone and content, along with a list of keywords to incorporate. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt. - from_government_and_meteorological_agencies_with_warning_alerts: The LLM was instructed to simulate posts issued by government or meteorological agencies, particularly those providing public warnings and alerts. This helped introduce an institutional perspective into the dataset. These are annotated as 'fire' or 'flood' depending on the incident used in the prompt. - `unrelated_to_crisis_discussing_random_topics`: To simulate unrelated social media posts, the LLM was prompted to produce posts about various topics such as politics, celebrities, cancer, music, sports, food, lifestyle, memes, breaking news, personal updates, and tourism. These topics were chosen to represent general social media noise. These are annotated as 'nones'. - related_to_crisis_but_lacking_useful_information: Here, the LLM was instructed to generate posts mentioning the incident but without providing actionable content—such as those requesting donations, expressing sympathy, criticizing authorities, or promoting conspiracy theories. These are annotated as 'nones'. The prompts for each category can be found at this repository. Who are the source language producers? The source language produced by the LLMs used for generation. Annotations Annotation process The annotations were produced by the LLMs. They were further cleaned using Cleanlab following their documentation. Error labels and low quality labels below 0.3 were dropped, we further re-labelled instances of `fire, flood` with label quality less than 0.7 as `none`. Our annotation and label cleaning process can be found at the repository. Who are the annotators? The annotators were the LLMs used for generation. Personal and Sensitive Information N/A Considerations for Using the Data Social Impact of Dataset We hope this data can improve research into weather emergency detection in social media data. Discussion of Biases We are aware that, since the data comes from LLMs, they can contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact. Other Known Limitations The dataset is fully generated by LLMs and should be used solely for augmenting real data. Additional Information Dataset Curators Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center. This work has been developed under the EU-funded CREXDATA Project (Grant Agreement No. 101092749). Since, part of this data was generated using Google's Gemma 3 model, its usage should follow Terms of Use and Prohibited Use Policy. Licensing Information Creative Commons Attribution 4.0.

Related Organizations

Barcelona Supercomputing Center
Spain

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Funded by

EC| CREXDATA

Related to Research communities

Cancer Research