Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

HazMiner dataset

Authors: Valkenborg, Bram; Dewitte, Olivier; Smets, Benoît;

HazMiner dataset

Abstract

The HazMiner dataset contains the location, timing and impact of geo-hydrological hazard events (flood, landslide and flash flood) at the global scale. The data is extracted using a paragraph based text mining method called HazMiner. It uses large language models to extract infromation from online news articles. The HazMiner method is specifically designed to improve the documentation of geo-hydrological hazards in the Global South. The current version contains events from 2017 through 2024, containing 21,411 flood, 7,659 landslide and 3,606 flash flood events, extracted from 6,366,905 news articles in 58 languages. More information on HazMiner: More information on the code General information The dataset contains information on the articles, paragraphs and events. Articles (Level 1): the articles used to extract the geo-hydrological events Paragraphs (Level 2): the paragraphs of the corresponding articles with their extracted information on the location, timing and impact Events (Level 3): geo-hydrological hazard events represents clustered paragraphs that occur around the same time in space The datasets are linked to eachother by different ids, each article, paragraph and event has its own id assigned. All articles were extracted from the GDELT Global Knownledge Graph (GDELT, 2025). How to get started More to follow soon. Structure Articles Column Description ArticleID The id of the article title The title of the article url The url of the article domain The corresponding domain of the articles sourcecountry The source country of the articles Publication_time The publication time (YYYY-MM-DD HH:MM:SS) Location The place names extracted from the article by a NER large language model NER score The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location') language_iso The language of the article (ISO 639-1) Paragraphs Column Description ParagraphID The id of the paragraph Hazard type The hazard type (flood, landslide or flash flood), identified by a zero-shot classification model Hazard type score The score give by the zero-shot classification models Time Time of the hazard extracted from a time reference in the text. When there is no time reference, it will equal to the publication time of the article. (YYYY-MM-DD HH:MM:SS) Publication_time The publication time of the article (YYYY-MM-DD HH:MM:SS) Location The place names extracted from the article by a NER large language model NER score The score assinged to the output of the large language model, each location has its own score (output in the same order as 'Location') lat The latitude of the paragraph, a weighted average of all locations metioned in the paragraph (°) lon The longitude of the paragraph, a weighted average of all locations metioned in the paragraph (°) minLat The southern boundary of the paragraph location (°) maxLat The northern boundary of the paragraph location (°) minLon The western boundary of the paragraph location (°) maxLon The eastern boundary of the paragraph location (°) number_death The number of death extracted by a large language model (Q&A) score_death The score on the answer for the number of death returned by the large language model answer_death The answer for the number of death returned by the large language model number_homeless The number of homeless extracted by a large language model (Q&A) score_homeless The score on the answer for the number of homeless returned by the large language model answer_homeless The answer for the number of homeless returned by the large language model number_injured The number of injured extracted by a large language model (Q&A) score_injured The score on the answer for the number of injured returned by the large language model answer_injured The answer for the number of injured returned by the large language model number_affected The number of affected extracted by a large language model (Q&A) score_affected The score on the answer for the number of affected returned by the large language model answer_affected The answer for the number of affected returned by the large language model number_missing The number of missing extracted by a large language model (Q&A) score_missing The score on the answer for the number of missing returned by the large language model answer_missing The answer for the number of missing returned by the large language model number_evacuated The number of evacuated extracted by a large language model (Q&A) score_evacuated The score on the answer for the number of evacuated returned by the large language model answer_evacuated The answer for the number of evacuated returned by the large language model ArticleID The id of the corresponding article title The title of the corresponding article domain The domain of the corresponding article sourcecountry The source country of the corresponding articles language_iso The language of the article (ISO 639-1) EventID The id of the corresponding event Events Column Description EventID The id of the event Hazard type The hazard type of the event hazard_score The score give by the zero-shot classification models (average of paragraphs) lat The latitude of the event (medoid of parapgraphs) (°) lon The longitude of the event (medoid of parapgraphs) (°) min_lat The southern boundary of the event (most southern paragraph) (°) max_lat The northern boundary of the event (most northern paragraph) (°) min_lon The western boundary of the event (most western paragraph) (°) max_lon The eastern boundary of the event (most eastern paragraph) (°) Start The start of the event (timing of the first paragraph) (YYYY-MM-DD) End The end of the event (timing of the last paragraph) (YYYY-MM-DD) Time The timing of the event (median time of all paragraphs) (YYYY-MM-DD) Duration The duration of the event (days) Paragraphs The paragraph ids of all paragraphs of the event Articles The article ids of all articles of the event n_paragraphs The number of paragraphs n_articles The number of articles n_language The number of languages n_sourcecountry The number of source countries n_domain The number of domains mostfreq_death The most frequently reported number of death n_mostfreq_death The number of times the most frequently reported number of death is reported median_death The median number of death (median of all paragraphs) mostfreq_homeless The most frequently reported number of homeless n_mostfreq_homeless The number of times the most frequently reported number of homeless is reported median_homeless The median number of homeless (median of all paragraphs) mostfreq_injured The most frequently reported number of injured n_mostfreq_injured The number of times the most frequently reported number of injured is reported median_injured The median number of injured (median of all paragraphs) mostfreq_affected The most frequently reported number of affected n_mostfreq_affected The number of times the most frequently reported number of affected is reported median_affected The median number of affected (median of all paragraphs) mostfreq_missing The most frequently reported number of missing n_mostfreq_missing The number of times the most frequently reported number of missing is reported median_missing The median number of missing (median of all paragraphs) mostfreq_evacuated The most frequently reported number of evacuated n_mostfreq_evacuated The number of times the most frequently reported number of evacuated is reported median_evacuated The median number of evacuated (median of all paragraphs) Disclaimer The dataset is part of a preprint, once published the data will be available in open access. The HazMiner database was created through lawful text and data mining (TDM) in accordance with Article 3 of Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market. All data contained in this database are the result of automated extraction and synthesis from lawfully accessible sources, including publicly available news articles indexed in the GDELT database. The dataset contains only factual information (e.g., time, location, type of event, reported impacts) and does not reproduce any protected expression or copyrighted content from the original sources.

Keywords

Landslide, Text Mining, Flash flood, News Articles, Flood, Natural Hazards

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average