PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue

Kocabiyikoglu, Ali Can; Portet, François; Gibert, Prudence; Blanchon, Hervé; Babouchkine, Jean-Marc; Gavazzi, Gaëtan

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Audiovisual . 2023

License: CC BY

Data sources: ZENODO

ZENODO

Audiovisual . 2023

License: CC BY

Data sources: Datacite

ZENODO

Audiovisual . 2023

License: CC BY

Data sources: Datacite

PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue

appsOther research productkeyboard_double_arrow_right Audiovisual 08 Nov 2023 French Publisher:Zenodo

Authors: Kocabiyikoglu, Ali Can; Portet, François; Gibert, Prudence; Blanchon, Hervé; Babouchkine, Jean-Marc; Gavazzi, Gaëtan;

doi: 10.5281/zenodo.10080490 , 10.5281/zenodo.6482586

PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue

- Summary
- Subjects
- Metrics

Abstract

PxCorpus : A Spoken Drug Prescription Dataset in French PxCorpus is to the best of our knowledge, the first spoken medical drug prescriptions corpus to be distributed. It contains 4 hours of transcribed and annotated dialogues of drug prescriptions in French acquired through an experiment with 55 participants experts and non-experts in drug prescriptions. The automatic transcriptions were verified by human effort and aligned with semantic labels to allow training of NLP models. The data acquisition protocol was reviewed by medical experts and permit free distribution without breach of privacy and regulation. Overview of the Corpus The experiment has been performed in wild conditions with naive participants and medical experts. In total, the dataset includes 2067 recordings of 55 participants (38% non-experts, 25% doctors, 36% medical practitioners), manually transcribed and semantically annotated. | Category | Sessions | Recordings | Time(m)| |-----------------------| ------------- | --------------- | ----------- | | Medical experts | 258 | 434 | 94.83 | | Doctors | 230 | 570 | 105.21 | | Non experts | 415 | 977 | 62.13 | | Total | 903 | 1981 | 262.27 | License We hope that that the community will be able to benefit from the dataset which is distributed with an attribution 4.0 International (CC BY 4.0) Creative Commons licence. How to cite this corpus If you use the corpus or need more details please refer to the following paper: A spoken drug prescription datset in French for spoken Language Understanding @InProceedings{Kocabiyikoglu2022, author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan Gavazzi", title = "A spoken drug prescription datset in French for spoken Language Understanding", booktitle = "13th Language Ressources and Evaluation Conference (LREC 2022)", year = "2022", location = "Marseille, France" } a more complete description of the corpus acquisition is available on arxiv @misc{kocabiyikoglu2023spoken, title={Spoken Dialogue System for Medical Prescription Acquisition on Smartphone: Development, Corpus and Evaluation}, author={Ali Can Kocabiyikoglu and François Portet and Jean-Marc Babouchkine and Prudence Gibert and Hervé Blanchon and Gaëtan Gavazzi}, year={2023}, eprint={2311.03510}, archivePrefix={arXiv}, primaryClass={cs.CL} } Project Structure The project contains the following elements . ├── LICENSE ├── PxDialogue/ ├── PxSLU/ ├── readme.md PxSLU : Prescription Corpus for Spoken Language Understanding Directory Structure . ├── LICENSE ├── metadata.txt ├── paths.txt ├── PxSLU_conll.txt ├── readme.md ├── recordings ├── seq.in ├── seq.label ├── seq.out ├── Demo.ipynb └── verifications.py Recordings The recordings directory contains the 903 recording sessions. Each session can contain several recordings. For instance, the directory recordings/J7aVvWb67L contains the records recording_0.wav recording_2.wav which represent two attempts to record a drug prescription All records are stored as mono channel wav files of 16kHz 16bits signed PCM Paths contains the list of all the .wav files in the recordings directory 00MYcyVK0t/recording_0.wav 00MYcyVK0t/recording_2.wav 02Qp6ICj9Q/recording_0.wav 02Qp6ICj9Q/recording_1.wav ... All other files (metadata.txt, seq.*) refer to this list to describe the recording. Metadata contains the information about the participants: 48,60+,F,non-expert 48,60+,F,non-expert 24,18–28,F,doctor 24,18–28,F,doctor ... The first column is the participant unique id, the second is the age range, the third is the gender and the final is the category of the participant in {doctor,expert, non-expert}. doctor correspond to a physician, (other)expert to a pharmacist or a biologist specialized in drugs while non-expert are other people not entering in these categories. The lines are synchronised with the paths.txt lines. Labels the three files seq.label, seq.in, seq.out represent respectivly the intent, the transcript and the entities in BIO format. seq.label | seq.in | seq.out medical_prescription | flagyl 500 milligrammes euh qu/ en... | B-drug B-d_dos_val B-d_dos_up O O ... medical_prescription | 3 comprimés par jour matin midi ... | B-dos_val B-dos_uf O O B-rhythm_tdte B-rhythm_tdte O B-rhythm_tdte ... ... | ... | ... These lines are synchronised with the paths.txt lines. Another file "PxSLU_conll.txt" is provided in a format inspired by the conll format (https://universaldependencies.org/format.html). However, this one is *not* aligned with the acoustic records file paths.txt. Scripts verifications.py performs the checking of the alignement of all the seq.* paths.txt and metadata files. A user of the dataset does not need to use this script unless she plan to extend the datasets with her own data. Demo.ipynb is a jupyter notebook that a user can run to search through the dataset. It is intended to let the user have a quicker and smoother view on the dataset. Data splits In the data_splits folder, you can find a data split of this dataset organized as following: - train.txt: medical experts + non experts (80%) = 1128 samples - dev.txt: medical experts + non experts (20%) = 283 samples - test.txt: doctors (100%) = 570 samples Each file contains references to line numbers of the corpus. For example, first line of the test.txt is 904, seq.in file contains the utterance "nicopatch". Users can access the labels, slots, metadata using the same line number 904 in the parallel files (paths.txt,seq.out,seq.label,...). PxDialogue : Prescription recording corpus for dialogue systems PxDialogue corpus comes as an extension of the PxSLU corpus and provides additional information about the dialogues that was collected through spoken dialogue. This corpus includes two additional files: ├── events.txt ├── dialogue_annotations.txt Events.txt: For each dialogue session, all dialogue events are given in this text file which can be used to train/evaluate dialogue systems. Usage example: PxSLU (paths.txt) - 00MYcyVK0t/recording_0.wav - 00MYcyVK0t/recording_2.wav PxDialogue (events.txt) - (-1, 'START', 'APP', None, 0) (1, 'user', 'ASR', 'flagyl 500 mg en cachet pendant 8 jours', 30) (1, 'system', 'TTS', 'Choisissez le médicament correspondant à votre recherche', 34) (2, 'user', 'UI', 'listview_item_clicked', 40) (2, 'system', 'TTS', 'Pourriez vous préciser la posologie pour le patient?', 40) (3, 'user', 'ASR', '3 comprimés par jour matin midi et soir pendant 10 jours', 64) (3, 'system', 'TTS', "Est-ce que vous confirmez l'ajout de cette prescription sur la liste?", 66) (4, 'user', 'UI', '/inform{"validate":"validate"}', 73) (4, 'system', 'TTS', 'Prescription validée avec succès. Traitement ajouté sur le dossier du patient', 73) (-1, 'END', 'APP', '', 73) - N/A For example, in this dialogue session (00MYcyVK0t), there are two recordings. The events are given in a single row for each dialogue session once in the first recording (recording_0). Dialogues are described in form of events where each action taken by the user or the system is considered as a dialogue turn in a tuple form. (-1, 'START', 'APP', None, 0) - First element of the event is the dialogue turn number. -1 means that the application is initialized. - Second element describes who initiated the event: user, system, START, END - Third element describes the type of the event: APP (start and end events) , ASR (automatic speech recognition), TTS (text-to-speech), UI (user interface) User clicks on buttons triggers sometimes explicit intent recognition. For example (4, 'user', 'UI', '/inform{"validate":"validate"}') describes the explicit intent of validation of the prescription. - Fourth element is the timestamp (in seconds) Dialogue annotations We also include a manual annotation for dialogues (dialogue_annotations.txt) which indicates for each recording, if the system gave the correct answer given the utterance. Each line contains a keyword, either [Fail] or [OK]. The following example shows a dialogue sample with annotations: | dialogue_annotations.txt | paths.txt | seq.in | |----------------------------------|----------------------------------------|--------------------------------------------------------------------------------------| | OK | 14yHtAe555/recording_0.wav | oxytetracycline solution euh | | OK | 14yHtAe555/recording_1.wav | oxytetracycline solution 5 gouttes matin et soir pendant 14 jours | | Fail | 14yHtAe555/recording_2.wav | oxytetracycline solution 5 gouttes matin et soir pendant 14 jours | For these 3 dialogues, the dialogue annotations are accordingly OK, OK and Fail. [Ok] means that the dialogue system reacted correctly to the input. [Fail] means that the action of the system after this utterance should not be used for evaluation or training. We can notice that in the first utterance, the information are missing, however after the second example the system normally have all of the required slots for the prescription validation. It is to note that free comments added using the ASR system were noted as Fail as these dialogues did not enter the dialogue state tracking. In this example, the last utterance is recorded as a free comment by the prescriber and was annotated as Fail. Linking audio records to ASR events In order to link audio records to ASR events, the user has to use both paths.txt and events.txt For example for the following dialogue session (lines 1:2 of paths.txt): 00MYcyVK0t/recording_0.wav 00MYcyVK0t/recording_2.wav Events.txt include two ASR events: (1, 'user', 'ASR', 'flagyl 500 mg en cachet pendant 8 jours', 30) (3, 'user', 'ASR', '3 comprimés par jour matin midi et soir pendant 10 jours', 64) These ASR events corresponds to the recording files that can be found in the recordings folder. Available user action annotations Events.txt include annotations such as below with the following explanation: - /inform{"validate":"validate"} : User clicks on the validate button after seeing the prescription - /inform{"validate":"refuse"} : User clicks on the refuse button after seeing the prescription - ASR : User clicks on the push-to-talk button to record an utterance - listview_item_clicked : User clicks on the list to choose a drug - listview_cancel_clicked : User clicks on the cancel button after seeing a list of drugs - FREE_COMMENT_ADDED : User clicks and records a free-form utterance by clicking "add free comment" button - EMPTY_UTTERANCE: Recording containing an empty utterance - APP_CRASH : An application crash that happened in the dialogue turn - EVAL_FINISH_APPROVED : User clicks on the final upload button to finish the experiment - RESTART_CONVERSATION_SESSION : User clicks on the restart conversation button - RESTART_CANCEL_CLICKED : User cancels the restart process by clicking on the cancel button - EVAL_FINISH_CANCELED : User cancels the final upload process by clicking on the cancel button ** Free Comments: ** The users had the possibility of recording a speech-to-text message upon viewing a prescription. These messages had not beed added to the dialogue state tracking but were visualized on the interface and saved in the database. Users can find free comments by searching for FREE_COMMENT_ADDED events in the events.txt to find out about these events.

Related Organizations

Centre Hospitalier Universitaire de Bordeaux
France

Keywords

speech corpora, spoken dialogue systems, natural language understanding, biomedical nlp, health informatics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Beta

SDGs Suggest

16. Peace & justice

Beta

SDGs:

16. Peace & justice,