Semi-Automatic Parallel Corpora Extraction from Comparable News Corpora

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 30 Jun 2010Publisher:Centro de Innovacion y Desarrollo Tecnologico en ComputoJournal:Polibits, volume 41, pages 11-17 (issn: 1870-9044, eissn: 2395-8618,

Copyright policy )

Authors: Thoudam Doren Singh; Sivaji Bandyopadhyay;

doi: 10.17562/pb-41-2

Semi-Automatic Parallel Corpora Extraction from Comparable News Corpora

- Summary
- Metrics

Abstract

The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English from a comparable news corpora collected from the web. A medium sized Manipuri–English bilingual lexicon and another list of Manipuri–English transliterated entities have been developed and used in the present work. Using morphological information for the agglutinative and inflective Manipuri language, the alignment quality based on similarity measure is further improved. A high level of performance is desirable since errors in sentence alignment cause further errors in systems that use the aligned text. The system has been evaluated and error analysis has also been carried out. The technique shows its effectiveness in Manipuri–English language pair and is extendable to other resource constrained, agglutinative and inflective Indian languages

Related Organizations

Jadavpur University
India

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now