Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 5 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

MAPE: A Dataset of Correspondence from the Portuguese Empire

Authors: Błoch, Agata; Vasques Filho, Demival; Bojanowski, Michał; Santana, Clodomir; Hussain, Saddam;

MAPE: A Dataset of Correspondence from the Portuguese Empire

Abstract

We present the MAPE dataset: Mapping the Atlantic Portuguese Empire: a large-scale historical resource curated from archival material. The dataset is made available in different versions together with its detailed description: MAPE Dataset: Raw Archival Materials (version 1) MAPE Dataset: Bilingual Version (Portuguese-English) (version 2) MAPE Dataset: Bilingual Version with Senders and Recipients (version 3) MAPE Dataset: Bilingual Version with Senders and Recipients with assigned topics (version 4) The MAPE dataset comprises 182,491 historical correspondence records from the Arquivo Histórico Ultramarino de Lisboa (Portuguese Overseas Archives of Lisbon, hereafter AHU), in particular from the collection of the Conselho Ultramarino (Overseas Council), covering the period from 1581 to 1859. The AHU holds an extensive archive of correspondence covering the administrative, diplomatic and commercial activities of the Portuguese Empire. The “Conselho Ultramarino”, created in 1642, represents formal bureaucratic communication between Lisbon and its overseas dominions and covers topics such as colonial administration, trade, diplomacy and social developments. Originally, these materials were only available as unstructured PDF files, which posed a major challenge for data analysis and large-scale retrieval. These PDF documents contained not only the core correspondence registers, but also a variety of non-essential metadata, such as cataloging details, pagination markers, section headings, and record summaries. The mixing of primary records with additional metadata hindered effective content analysis, searching and visualization. To overcome these challenges, we converted the PDFs into a structured format (CSV) that isolates the main data elements, improving searchability, navigation and analytical potential. This restructuring allows researchers to work directly with the primary correspondence records without the noise of the surrounding archival metadata. The data for this study were obtained from the Arquivo Histórico Ultramarino, where historical documents were preserved primarily in unstructured formats, predominantly as PDF files. These documents were publicly available for free download at https://actd.iict.pt/collection/actd:CU. The collections were originally divided into large sections such as Portugal, Africa, Brazil, etc. Within each section there are further subdivisions corresponding to the different colonies of the Portuguese empire at that time. The correspondence register of each colony is stored in individual PDF files, which are organized chronologically. However, these files also contain extraneous metadata such as headings, page numbers, cataloging details and document summaries, which make it difficult to extract the relevant content. As a rule, a header is followed by a short summary of the correspondence, with each document being provided with details of the archiving source. Our research focuses primarily on specific collections within the AHU, arranged chronologically and geographically, which include the following: Africa: The Angola Collection (Série Angola), whose cataloging was financially supported by the Portuguese Fundação para a Ciência e Tecnologia as part of the project África Atlântica: da documentação ao conhecimento, sécs. XVII-XIX (Atlantic Africa: from documentation to knowledge, seventeenth to nineteenth centuries). The Cabo Verde and Guinea Collection (Série Cabo Verde, Série Guiné), which was cataloged as part of two separate projects: the aforementioned África Atlântica and the Resgate do acervo histórico de Cabo Verde em Portugal (Rescue the historical collection of Cape Verde in Portugal) funded by Camões, Instituto da Cooperação e da Língua (ICL). The São Tomé Collection (Série S. Tomé e Príncipe), also cataloged within the África Atlântica project. Brazil: The “Barão do Rio Branco” — Historical Documentation Rescue Project known as Projeto Resgate (Bertoletti et al. 2022; Boschi 2018) includes 26 catalogues of documents referring to Brazilian regions, cataloged at different times and by different researchers. The Projeto Resgate collection is currently managed by the National Library of Rio de Janeiro in Brazil, but is housed in the AHU. Portugal: Madeira-CA and Madeira. Rio da Prata: Nova Colónia do Sacramento, Montevideu, Buenos Aires, Paraguai Oriente Macau Timor MAPE Dataset: Raw Archival Materials (version 1) Column Type Description doc_id Integer Unique identifier for each record. doc_source String Archival origin (e.g. ALAGOAS, BAHIA, Cabo Verde). doc_box String Physical box code within the archive (e.g. Cx.1). doc_number String Document number within the box (zero-padded, e.g. 00001). doc_type String Type of register (e.g. INFORMACAO, CONSULTA, CARTA, PROPOSTA, REQUERIMENTO, PARECER). year Integer Four-digit year of the correspondence (e.g. 1690). month Integer Month of the register (1–12). Blank if not recorded in the original. day Integer Day of the month (1–31). Blank if not recorded. reference_code String Integer Unique identifier for each record. doc_link URL Direct link to the AHU catalog entry for the document. Doc_Text String Original Portuguese summary of the correspondence, as transcribed from the archival register. The MAPE dataset is provided as a single CSV file the repository root. It consolidates all correspondence registers extracted from the AHU PDFs into a uniform tabular structure. 2. MAPE Dataset: Bilingual Version (Portuguese-English) (version 2) It is an updated version of MAPE Dataset: Raw Archival Materials (version 1) Multilingual Adaptation of Consolidated Data Files The consolidated dataset originally contained correspondence in Portuguese, which was a significant barrier for a global audience. To overcome this limitation, we translated the original content into English using Google Gemini 1.5 Flash, a lightweight transformer-based model optimized for multilingual text processing and translation. Google Gemini 1.5 Flash supports over 100 languages and is designed to strike a balance between speed, computational efficiency and high-quality text creation. With a context window of up to 1 million tokens, it can process large volumes of text in a single prompt and is therefore well suited to the translation of historical documents. As our dataset consists of colonial-era correspondence, it was important to maintain historical accuracy and linguistic integrity. To achieve this, we carefully crafted the following translation prompt: Prompt:"You are a skilled historical linguist and translator with deep knowledge of both colonial-era Portuguese and archaic/historical English usage. Your task is to translate the following Portuguese text into an English style that reflects the era in which it was originally written. Please: Maintain the historical tone. Avoid modern terms and slang. Capture the nuanced formality of the original text." The translated dataset, which is structured in the same format as the original, ensures linguistic and historical authenticity and at the same time makes the correspondence accessible to a wider audience 3. MAPE Dataset Bilingual Version with Senders and Recipients (version 3) Version 3 is the updated version 2 with extracted senders and recipients. 4. MAPE Dataset: Bilingual Version with Senders and Recipients with assigned topics (version 4) This version enriches the previous versions of bilingual dataset by adding automatically assigned topics to each correspondence record. Topics were generated through a multi-stage framework combining large language models, multilingual embeddings, and clustering, producing both concise document-level tags and macro-level thematic groups. At the highest level, the corpus is organized into seven thematic clusters, providing a scalable structure for thematic exploration and navigation across the 182,000+ records of the Portuguese Overseas Historical Archive. Topic Number of records % if the corpus Colonial Administration, Trade & Revenue 32,298 19,2% Petitions, Appointments & Royal Permissions 30,881 18,4% Military Personnel, Discipline & Logistics 28, 617 17,0% Maritime Trade, Naval Operations & Logistics 27, 976 16,6% Civil Administration, Justice & Royal Governance 24,764 14,7% Land Grants, Passports & Travel 13,917 8,3% Military Appointments, Ranks & Confirmations 9,592 5,7% Observations: In the next step, we should refine the raw recipient data. In some cases, particularly "passports", errors may occur in the recipient fields.

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average