Compilation and Exploitation of Parallel Corpora

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2003 English Publisher:Faculty of Electrical Engineering and Computing, Univ. of ZagrebJournal:Journal of Computing and Information Technology, volume 11, page 93 (issn: 1330-1136, eissn: 1846-3908,

Copyright policy )

Authors: Toma� Erjavec;

doi: 10.2498/cit.2003.02.02

Compilation and Exploitation of Parallel Corpora

- Summary
- Metrics

Abstract

With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely aWeb concordancer and the extraction of bi-lingual lexica.

Related Organizations

Jožef Stefan Institute
Slovenia

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

gold