publication . Doctoral thesis . 2015

Ad hoc and general-purpose corpus construction from web sources

Barbaresi, Adrien;
  • Published: 19 Jun 2015
  • Publisher: HAL CCSD
  • Country: France
At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of t...
free text keywords: web corpus construction, corpus linguistics, construction de corpus web, linguistique de corpus, web crawling, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.1: Content Analysis and Indexing/H.3.1.3: Linguistic processing, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.6: Library Automation/H.3.6.0: Large text archives, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.7: Digital Libraries/H.3.7.0: Collection, ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.5: Document Capture/I.7.5.0: Document analysis, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.5: Online Information Services/H.3.5.1: Data sharing, ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.2: Document Preparation/I.7.2.9: Standards, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Communities with gateway
OpenAIRE Connect image
Other Communities
Funded by
Common Language Resources and Technology Infrastructure
  • Funder: European Commission (EC)
  • Project Code: 212230
  • Funding stream: FP7 | SP4 | INFRA
49 references, page 1 of 4

Abney, S., & Bird, S. (2010). The Human Language Project: building a universal corpus of the world's languages. In Proceedings of the 48th Annual Meeting of the ACL (pp. 88-97). Association for Computational Linguistics.

Abney, S. P. (1991). Parsing by chunks. Principle-based parsing, 44, 257-278.

Abramson, M., & Aha, D. W. (2012). What's in a URL? Genre ClassiVcation from URLs. In Intelligent techniques for web personalization and recommender systems. aaai technical report. Association for the Advancement of ArtiVcial Intelligence.

Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814-823.

Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141-161.

Arase, Y., & Zhou, M. (2013). Machine Translation Detection from Monolingual Web-Text. In Proceedings of the 51th Annual Meeting of the ACL (pp. 1597-1607).

Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and linguistic computing, 7(1), 1-16. [OpenAIRE]

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater R V.2. The Journal of Technology, Learning and Assessment, 4(3).

Auroux, S. (1998). La raison, le langage et les normes. PUF.

Austin, P. K. (2010). Current issues in language documentation. Language documentation and description, 7, 12-33.

Bachelard, G. (1927). Essai sur la connaissance approchée. Vrin.

Bader, M., & Häussler, J. (2010). Word Order in German: A Corpus Study. Lingua, 120(3), 717-762.

Baisa, V. (2009). Web content cleaning. Unpublished master's thesis, Faculty of Informatics, Masaryk university.

Baker, P., Hardie, A., McEnery, T., Xiao, R., Bontcheva, K., Cunningham, H., et al. (2004). Corpus linguistics and South Asian languages: Corpus creation and tool development. Literary and Linguistic Computing, 19(4), 509-524.

Barbaresi, A. (n.d.). German Political Speeches, Corpus and Visualization (Tech. Rep.). ICAR / ENS Lyon. (2nd Version, presented at the DGfS-CL poster session) Barbaresi, A. (2011a). Approximation de la complexité perçue, méthode d'analyse. In Actes (Vol. 6, pp. 12-21).

49 references, page 1 of 4
Any information missing or wrong?Report an Issue