publication . Doctoral thesis . 2015

Ad hoc and general-purpose corpus construction from web sources

Barbaresi, Adrien;
  • Published: 19 Jun 2015
  • Publisher: HAL CCSD
  • Country: France
At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of t...
free text keywords: web corpus construction, corpus linguistics, construction de corpus web, linguistique de corpus, web crawling, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.1: Content Analysis and Indexing/H.3.1.3: Linguistic processing, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.6: Library Automation/H.3.6.0: Large text archives, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.7: Digital Libraries/H.3.7.0: Collection, ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.5: Document Capture/I.7.5.0: Document analysis, ACM: H.: Information Systems/H.3: INFORMATION STORAGE AND RETRIEVAL/H.3.5: Online Information Services/H.3.5.1: Data sharing, ACM: I.: Computing Methodologies/I.7: DOCUMENT AND TEXT PROCESSING/I.7.2: Document Preparation/I.7.2.9: Standards, [SHS.LANGUE]Humanities and Social Sciences/Linguistics, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Common Language Resources and Technology Infrastructure
  • Funder: European Commission (EC)
  • Project Code: 212230
  • Funding stream: FP7 | SP4 | INFRA
