Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY SA
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY SA
Data sources: ZENODO
versions View all 3 versions
addClaim

The Vuk'uzenzele South African Multilingual Corpus

Authors: Marivate, Vukosi; Njini, Daniel; Madodonga, Andani; Lastrucci, Richard; Dzingirai, Isheanesu; Rajab, Jenalea;

The Vuk'uzenzele South African Multilingual Corpus

Abstract

# The Vuk'uzenzele South African Multilingual Corpus [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7598539.svg)](https://doi.org/10.5281/zenodo.7598539) Github: https://github.com/dsfsi/vukuzenzele-nlp ## About dataset The dataset contains editions from the South African government magazine Vuk'uzenzele. Data was scraped from PDFs that have been placed in the [data/raw](data/raw/) folder. The PDFS were obtained from the [Vuk'uzenzele website](https://www.vukuzenzele.gov.za/). The datasets contain government magazine editions in 11 languages, namely: | Language | Code | Language | Code | |------------|-------|------------|-------| | English | (eng) | Sepedi | (sep) | | Afrikaans | (afr) | Setswana | (tsn) | | isiNdebele | (nbl) | Siswati | (ssw) | | isiXhosa | (xho) | Tshivenda | (ven) | | isiZulu | (zul) | Xitstonga | (tso) | | Sesotho | (nso) | ### Number of Aligned Pairs with Cosine Similarity Score >= 0.65 | src_lang | trg_lang | num_aligned_pairs | |----------|----------|-------------------| | ven | zul | 186 | | ssw | xho | 1965 | | sep | xho | 279 | | nbl | zul | 227 | | nso | tsn | 1279 | | nso | tso | 1491 | | tsn | zul | 1346 | | afr | eng | 1369 | | eng | ssw | 1601 | | afr | ssw | 1496 | | nbl | ssw | 264 | | tso | zul | 1758 | | afr | zul | 1384 | | eng | zul | 1888 | | ssw | tsn | 1263 | | sep | tsn | 302 | | nso | xho | 1248 | | sep | tso | 324 | | ssw | tso | 1657 | | tsn | ven | 235 | | eng | nbl | 153 | | nso | sep | 349 | | afr | nbl | 359 | | nbl | ven | 657 | | eng | ven | 243 | | afr | ven | 281 | | tso | ven | 256 | | ven | xho | 215 | | eng | tsn | 1380 | | afr | tsn | 1076 | | nso | ssw | 1132 | | eng | tso | 2016 | | afr | tso | 1139 | | xho | zul | 1895 | | tsn | xho | 1209 | | sep | zul | 223 | | nbl | xho | 204 | | ssw | zul | 2161 | | afr | xho | 1363 | | eng | xho | 1354 | | tso | xho | 1485 | | sep | ssw | 219 | | nbl | tso | 215 | | tsn | tso | 1570 | | nso | zul | 1247 | | nbl | tsn | 140 | | eng | sep | 276 | | afr | sep | 394 | | ssw | ven | 217 | | sep | ven | 1140 | | afr | nso | 962 | | eng | nso | 1721 | | nbl | nso | 151 | | nbl | sep | 843 | | nso | ven | 262 | The dataset is present in several forms on the repo. Generally the dataset is split by edition, eg. `2020-01-ed1` The data directory is broken down as follows ``` ./data ├── external # Data external to this repo ├── interim # I am not really sure - looks like interim in regards to processed. ├── processed # The data from scraping the raw pdfs ├── raw # The raw pdfs of the Vuk'uzenzele magazine ├── sentence_align_output # The output (csv) of the sentence alignment with LASER language encoders └── simple_align_output # The output (csv) of a simple one to one sentence alignment ``` The dataset is split by edition in the [data/processed](data/processed/) folder. Authors ------- - Vukosi Marivate - [@vukosi](https://twitter.com/vukosi) - Andani Madodonga - Daniel Njini - Richard Lastrucci Citation -------- Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai . **The Vuk'uzenzele South African Multilingual Corpus**, 2023 > @inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" } > @dataset{marivate_vukosi_2023_7598540, author = {Marivate, Vukosi and Njini, Daniel and Madodonga, Andani and Lastrucci, Richard and Dzingirai, Isheanesu}, title = {The Vuk'uzenzele South African Multilingual Corpus}, month = feb, year = 2023, publisher = {Zenodo}, doi = {10.5281/zenodo.7598539}, url = {https://doi.org/10.5281/zenodo.7598539} } Licences ------- * License for Data - [CC 4.0 BY SA](LICENSE.data.md) * Licence for Code - [MIT License](LICENSE.md)

Related Organizations
Keywords

natural language processing

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 34
    download downloads 5
  • 34
    views
    5
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
34
5
Related to Research communities