search
Include:
5 Research products, page 1 of 1

  • FR
  • GB
  • HAL Clermont Université
  • DARIAH EU

Relevance
arrow_drop_down
  • Open Access English
    Authors: 
    Thierry Chanier; Céline Poudat; Benoît Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djamé Seddah;
    Publisher: HAL CCSD
    Country: France

    Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel); International audience; The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

  • English
    Authors: 
    Beisswenger, Michael; Chanier, Thierry; Ehrhardt, Eric; Herold, Axel; Lüngen, Harald; Poudat, Céline; Storrer, Angelika;
    Publisher: HAL CCSD
    Country: France

    International audience; The panel presents results and ongoing work from corpus projects in which TEI-P5 hasbeen adopted for the representation and linguistic annotation of genres of social mediaand computer-mediated communication (CMC). It relates to the work of the TEI-SIG“computer-mediated communication” which is developing TEI models for therepresentation of CMC genres and testing these models for a broad range of genres(ranging from “text-only” genres such as chat and SMS to multimodal genres such aslearning environments and Second Life) and in corpus building initiatives for variousEuropean languages.The goal of the panel is to give an overview of models and practices in representingCMC in TEI on the example of German and French CMC corpora. A documentation andODD files of the schemas developed by the group will be made available in the TEI wikiand be announced via the TEI mailing list before the conference so that everybody whois interested in participating in the discussion can examine the CMC models in advance.The discussion in the panel shall serve as an opportunity for collecting feedback onthese models and schema drafts from a broader community within the TEI who isinterested in adapting TEI-P5 for the representation of new (digital) genres. Thisfeedback will be taken into consideration when revising the models and – as a next stepafter the conference – preparing feature requests for adapting the TEI for CMC.

  • Open Access French
    Authors: 
    Chanier, Thierry;
    Publisher: HAL CCSD
    Country: France

    Conférence invitée, voir la vidéo de la présentation : http://videocampus.univ-bpclermont.fr/?v=SXX3gjZtTjYZ , à partir du temps : 00:17:46; Le monde universitaire est producteur de données de différentes natures. L'ouverture et le partage de chaque type de données introduit des problématiques spécifiques. Cette variété s'explique en premier lieu par les situations particulières qui ont gouverné leur création. Mais les enjeux d'utilisation, par les communautés universitaires, les communautés de chercheurs ou la société en général, diffèrent aussi suivant chaque type de données. Nous évoquerons brièvement un premier type de données, celles pédagogiques, en lien avec le mouvement en accès libre intitulé Open Educational Ressources (OER). Le second type de données, cette fois faisant partie du résultat de la recherche, concerne les publications. Notre communication rappellera brièvement, afin de mieux les distinguer du dernier type de données, les contraintes particulières qui ont motivé le développement de l'accès libre (open access) aux publications, les différentes voies suivies, l'état actuel après plus de 10 ans d'existence. L'essentiel de notre propos sera consacré au partage des données de la recherche, qui peuvent ou non être reliées aux publications. Nous décrirons les motivations de ce mouvement OpenData, les enjeux pour les chercheurs, les conditions particulières de mise à disposition que devront avoir ces données pour être réellement OpenData. Nous évoquerons enfin les transformations profondes du métier de chercheur qui peuvent en résulter, en nous appuyant sur des exemples provenant principalement des sciences humaines.

  • Open Access English
    Authors: 
    Thierry Chanier; Ciara R. Wigham;
    Publisher: HAL CCSD
    Country: France

    International audience; This chapter gives an overview of one possible staged methodology for structuring LCI data by presenting a new scientific object, LEarning and TEaching Corpora (LETEC). Firstly, the chapter clarifies the notion of corpora, used in so many different ways in language studies, and underlines how corpora differ from raw language data. Secondly, using examples taken from actual online learning situations, the chapter illustrates the methodology that is used to collect, transform and organize data from online learning situations in order to make them sharable through open-access repositories. The ethics and rights for releasing a corpus as OpenData are discussed. Thirdly, the authors suggest how the transcription of interactions may become more systematic, and what benefits may be expected from analysis tools, before opening the CALL research perspective applied to LCI towards its applications to teacher-training in Computer-Mediated Communication (CMC), and the common interests the CALL field shares with researchers in the field of Corpus Linguistics working on CMC.

  • Publication . Conference object . Other literature type . 2019
    Open Access English
    Authors: 
    Beißwenger, Michael; Lüngen, Harald; Herzberg, Laura; Wigham, Ciara R.;
    Publisher: HAL CCSD
    Country: France

    International audience

search
Include:
5 Research products, page 1 of 1
  • Open Access English
    Authors: 
    Thierry Chanier; Céline Poudat; Benoît Sagot; Georges Antoniadis; Ciara Wigham; Linda Hriba; Julien Longhi; Djamé Seddah;
    Publisher: HAL CCSD
    Country: France

    Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel); International audience; The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective.

  • English
    Authors: 
    Beisswenger, Michael; Chanier, Thierry; Ehrhardt, Eric; Herold, Axel; Lüngen, Harald; Poudat, Céline; Storrer, Angelika;
    Publisher: HAL CCSD
    Country: France

    International audience; The panel presents results and ongoing work from corpus projects in which TEI-P5 hasbeen adopted for the representation and linguistic annotation of genres of social mediaand computer-mediated communication (CMC). It relates to the work of the TEI-SIG“computer-mediated communication” which is developing TEI models for therepresentation of CMC genres and testing these models for a broad range of genres(ranging from “text-only” genres such as chat and SMS to multimodal genres such aslearning environments and Second Life) and in corpus building initiatives for variousEuropean languages.The goal of the panel is to give an overview of models and practices in representingCMC in TEI on the example of German and French CMC corpora. A documentation andODD files of the schemas developed by the group will be made available in the TEI wikiand be announced via the TEI mailing list before the conference so that everybody whois interested in participating in the discussion can examine the CMC models in advance.The discussion in the panel shall serve as an opportunity for collecting feedback onthese models and schema drafts from a broader community within the TEI who isinterested in adapting TEI-P5 for the representation of new (digital) genres. Thisfeedback will be taken into consideration when revising the models and – as a next stepafter the conference – preparing feature requests for adapting the TEI for CMC.

  • Open Access French
    Authors: 
    Chanier, Thierry;
    Publisher: HAL CCSD
    Country: France

    Conférence invitée, voir la vidéo de la présentation : http://videocampus.univ-bpclermont.fr/?v=SXX3gjZtTjYZ , à partir du temps : 00:17:46; Le monde universitaire est producteur de données de différentes natures. L'ouverture et le partage de chaque type de données introduit des problématiques spécifiques. Cette variété s'explique en premier lieu par les situations particulières qui ont gouverné leur création. Mais les enjeux d'utilisation, par les communautés universitaires, les communautés de chercheurs ou la société en général, diffèrent aussi suivant chaque type de données. Nous évoquerons brièvement un premier type de données, celles pédagogiques, en lien avec le mouvement en accès libre intitulé Open Educational Ressources (OER). Le second type de données, cette fois faisant partie du résultat de la recherche, concerne les publications. Notre communication rappellera brièvement, afin de mieux les distinguer du dernier type de données, les contraintes particulières qui ont motivé le développement de l'accès libre (open access) aux publications, les différentes voies suivies, l'état actuel après plus de 10 ans d'existence. L'essentiel de notre propos sera consacré au partage des données de la recherche, qui peuvent ou non être reliées aux publications. Nous décrirons les motivations de ce mouvement OpenData, les enjeux pour les chercheurs, les conditions particulières de mise à disposition que devront avoir ces données pour être réellement OpenData. Nous évoquerons enfin les transformations profondes du métier de chercheur qui peuvent en résulter, en nous appuyant sur des exemples provenant principalement des sciences humaines.

  • Open Access English
    Authors: 
    Thierry Chanier; Ciara R. Wigham;
    Publisher: HAL CCSD
    Country: France

    International audience; This chapter gives an overview of one possible staged methodology for structuring LCI data by presenting a new scientific object, LEarning and TEaching Corpora (LETEC). Firstly, the chapter clarifies the notion of corpora, used in so many different ways in language studies, and underlines how corpora differ from raw language data. Secondly, using examples taken from actual online learning situations, the chapter illustrates the methodology that is used to collect, transform and organize data from online learning situations in order to make them sharable through open-access repositories. The ethics and rights for releasing a corpus as OpenData are discussed. Thirdly, the authors suggest how the transcription of interactions may become more systematic, and what benefits may be expected from analysis tools, before opening the CALL research perspective applied to LCI towards its applications to teacher-training in Computer-Mediated Communication (CMC), and the common interests the CALL field shares with researchers in the field of Corpus Linguistics working on CMC.

  • Publication . Conference object . Other literature type . 2019
    Open Access English
    Authors: 
    Beißwenger, Michael; Lüngen, Harald; Herzberg, Laura; Wigham, Ciara R.;
    Publisher: HAL CCSD
    Country: France

    International audience

Send a message
How can we help?
We usually respond in a few hours.