• shareshare
  • link
  • cite
  • add
auto_awesome_motion View all 3 versions
Publication . Conference object . Article . 2012

Beyond SoNaR: towards the facilitation of large corpus building efforts

Reynaert, M.; Schuurman, A.K.; Hoste, V.; Oostdijk, N.H.J.; Gompel, M. van;
Open Access
Published: 01 Jan 2012
Publisher: Paris
In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quali ty semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for. ispartof: pages:2897-2904 ispartof: Proceedings of the Eighth International conference on Language Resources and Evaluation (LREC) vol:8 pages:2897-2904 ispartof: International conference on Language Resources and Evaluation (LREC) location:Istanbul (Turkey) date:21 May - 27 May 2012 status: published
Subjects by Vocabulary

ACM Computing Classification System: GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)


NLP, corpus annotation

Related Organizations
21 references, page 1 of 3

G. Aston and L. Burnard. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh University Press, Edinburgh.

D. Broeder, O. Schonefeld, T. Trippel, D. Van Uytvanck, and A. Witt. 2011. A pragmatic approach to XML interoperability - the Component Metadata Infrastructure (CMDI). In Balisage: The Markup Conference 2011, volume 7. [OpenAIRE]

O. De Clercq, V. Hoste, and I. Hendrickx. 2011. CrossDomain Dutch Coreference Resolution. In Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria. RANLP 2011.

O. De Clercq, V. Hoste, and P. Monachesi. 2012. Evaluating automatic cross-domain Dutch semantic role annotation. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul, Turkey. LREC-2012.

B. Desmet and V. Hoste. 2010. Named Entity Recognition through Classifier Combination. In Computational Linguistics in the Netherlands 2010: selected papers from the twentieth CLIN meeting.

P. Herceg and C. Ball. 2011. A comparative study of PDF generation methods: Measuring loss of fidelity when converting Arabic and Persian MS Word files to PDF. Technical Report MTR110043, Mitre.

N. Ide, C. Macleod, C. Fillmore, and D. Jurafsky. 2000. The American National Corpus: An outline of the project. In Proceedings of International Conference on Artificial and Computational Intelligence. ACIDCA2000.

R. Mihalcea and V. Nastase. 2002. Letter level learning for language independent diacritics restoration. In Proceedings of CoNLL-2002, pages 105-111. Taipei, Taiwan.

P. Monachesi, G. Stevens, and J. Trapman. 2007. Adding semantic role annotation to a corpus of written Dutch. In Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic. ACL.

N. Oostdijk. 2006. A Reference Corpus of Written Dutch. Corpus Design. TR-D-COI-06f. [OpenAIRE]

Related to Research communities
Download fromView all 3 sources
Conference object . 2012
Providers: Lirias