Downloads provided by UsageCounts
doi: 10.18653/v1/w15-3402
handle: 2117/76611
Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English–Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts Peer Reviewed
:Informàtica [Àrees temàtiques de la UPC], comparable corpora, parallel corpora, translation, Computational linguistics, Àrees temàtiques de la UPC::Informàtica, multilingual, Lingüística computacional -- Metodologia, Wikipedia
:Informàtica [Àrees temàtiques de la UPC], comparable corpora, parallel corpora, translation, Computational linguistics, Àrees temàtiques de la UPC::Informàtica, multilingual, Lingüística computacional -- Metodologia, Wikipedia
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 11 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 98 | |
| downloads | 95 |

Views provided by UsageCounts
Downloads provided by UsageCounts