
handle: 10486/718859
We present the process of compiling a parallel corpus from financial reports in Spanish and their translation into English —downloaded from the websites of the IBEX-35 companies. Our aim is to create a segmented, aligned bilingual corpus to carry out linguistic and translation studies and to create linguistic resources for AI. The extraction and structuring of the information always pose the biggest challenges when compiling a corpus from PDF documents, as the information is presented in several columns with a non-linear organisation, which hinders the automatised extraction of the text. We showcase our method for extracting the narrative elements, the subsequent cleaning of the text and the alignment of the paragraphs in Spanish and English. The result is a CSV file containing both languages. We used 15 bilingual reports resulting in 1,678,426 words in Spanish and 1,452,636 words in English, and 56,170 segments in Spanish and 56,813 segments in English
This publication is part of the project "Computational linguistic methods for the readability and simplification of financial narratives. CLARA-FINT (PID2020- 116001RB-C31), funded by the Spanish Ministry of Science and Innovation and the State Research Agency
The dataset that supports the findings of this study are archived in the Universidad Autónoma de Madrid data repository e‐cienciaDatos in https://doi.org/10.21950/85MWYP
Informática, bilingual corpus, parallel corpus, compilation, financial domain
Informática, bilingual corpus, parallel corpus, compilation, financial domain
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
