Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Audiovisual . 2022
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Audiovisual . 2022
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Audiovisual . 2022
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

PxCorpus: A Spoken Drug Prescription Dataset in French for Spoken Language Understanding

Authors: Kocabiyikoglu, Ali Can; Portet, François; Gibert, Prudence; Blanchon, Hervé; Babouchkine, Jean-Marc; Gavazzi, Gaëtan;

PxCorpus: A Spoken Drug Prescription Dataset in French for Spoken Language Understanding

Abstract

PxSLU : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding PxSLU is to the best of our knowledge, the first spoken medical drug prescriptions corpus to be distributed. It contains 4 hours of transcribed and annotated dialogues of drug prescriptions in French acquired through an experiment with 55 participants experts and non-experts in drug prescriptions. The automatic transcriptions were verified by human effort and aligned with semantic labels to allow training of NLP models. The data acquisition protocol was reviewed by medical experts and permit free distribution without breach of privacy and regulation. Overview of the Corpus The experiment has been performed in wild conditions with naive participants and medical experts. In total, the dataset includes 1981 recordings of 55 participants (38% non-experts, 25% doctors, 36% medical practitioners), manually transcribed and semantically annotated. Category Sessions Recordings Time(m) Medical experts 258 434 94.83 Doctors 230 570 105.21 Non Experts 415 977 62.13 Total 903 1981 262.27 License We hope that that the community will be able to benefit from the dataset which is distributed with an attribution 4.0 International (CC BY 4.0) Creative Commons license. How to cite this corpus If you use the corpus or need more details please refer to the following paper: A spoken drug prescription dataset in French for spoken Language Understanding @InProceedings{Kocabiyikoglu2022, author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan Gavazzi", title = "A Spoken Drug Prescription Dataset in French for Spoken Language Understanding", booktitle = "13th Language Resources and Evaluation Conference (LREC 2022)", year = "2022", location = "Marseille, France" } Project Structure The projet contains the following elements . ├── LICENSE ├── metadata.txt ├── paths.txt ├── PxSLU_conll.txt ├── readme.md ├── recordings ├── seq.in ├── seq.label ├── seq.out ├── Demo.ipynb └── verifications.py Recordings The recordings directory contains the 903 recording sessions. Each session can contain several recordings. For instance, the directory recordings/J7aVvWb67L contains the records recording_0.wav recording_2.wav which represent two attempts to record a drug prescription All records are stored as mono channel wav files of 16kHz 16bits signed PCM Paths contains the list of all the wav files in the recordings directory 00MYcyVK0t/recording_0.wav 00MYcyVK0t/recording_2.wav 02Qp6ICj9Q/recording_0.wav 02Qp6ICj9Q/recording_1.wav ... All other files (metadata.txt, seq.*) refer to this list to describe the recording. Metadata contains the information about the participants: 48,60+,F,non-expert 48,60+,F,non-expert 24,18–28,F,doctor 24,18–28,F,doctor ... The first column is the participant unique id, the second is the age range, the third is the gender and the final is the category of the participant in {doctor,expert, non-expert}. doctor correspond to a physician, (other)expert to a pharmacist or a biologist specialized in drugs while non-expert are other people not entering in these categories. The lines are synchronised with the paths.txt lines. Labels the three files seq.label, seq.in, seq.out represent respectively the intent, the transcript and the entities in BIO format. seq.label seq.in seq.out medical_prescription flagyl 500 milligrammes euh qu/ en... B-drug B-d_dos_val B-d_dos_up O O ... medical_prescription 3 comprimés par jour matin midi ... B-dos_val B-dos_uf O O B-rhythm_tdte B-rhythm_tdte O B-rhythm_tdte ... ... ... ... These lines are synchronised with the paths.txt lines. Another file "PxSLU_conll.txt" is provided in a format inspired by the conll format (https://universaldependencies.org/format.html). However, this one is *not* aligned with the acoustic records file paths.txt. Scripts verifications.py performs the checking of the alignement of all the seq.* paths.txt and metadata files. A user of the dataset does not need to use this script unless she plan to extend the datasets with her own data. Demo.ipynb is a jupyter notebook that a user can run to search through the dataset. It is intended to let the user have a quicker and smoother view on the dataset. Data splits In the data_splits folder, you can find a data split of this dataset organized as following: - train.txt: medical experts + non experts (80%) = 1128 samples - dev.txt: medical experts + non experts (20%) = 283 samples - test.txt: doctors (100%) = 570 samples Each file contains references to line numbers of the corpus. For example, first line of the test.txt is 904, seq.in file contains the utterance "nicopatch". Users can access the labels, slots, metadata using the same line number 904 in the parallel files (paths.txt,seq.out,seq.label,...).

Keywords

speech corpora, spoken dialogue systems, natural language understanding, biomedical nlp, health informatics

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 57
    download downloads 18
  • 57
    views
    18
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
57
18