Low-resource automatic speech recognition and error analyses of oral cancer speech

descriptionPublicationkeyboard_double_arrow_right Article 01 Jun 2022 Netherlands English Publisher:Elsevier BVJournal:Speech Communication, volume 141, pages 14-27 (issn: 0167-6393,

Copyright policy )Funded by:EC | TAPAS

Authors: Bence Halpern; Siyuan Feng; Rob van Son; Michiel van den Brekel; Odette Scharenborg;

doi: 10.1016/j.specom.2022.04.006

handle: 11245.1/3ef4097c-670f-4cb3-a393-b70f40d19f8b

Low-resource automatic speech recognition and error analyses of oral cancer speech

- Summary
- Subjects
- Metrics

Abstract

In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we investigate three acoustic modelling approaches that previously have worked well with low-resource scenarios using two different architectures; a hybrid architecture and a transformer-based end-to-end (E2E) model: (1) a retraining approach; (2) a speaker adaptation approach; and (3) a disentangled representation learning approach (only using the hybrid architecture). The approaches achieve a (1) 4.7% (hybrid) and 7.5% (E2E); (2) 7.7%; and (3) 2.0% absolute word error rate reduction, respectively, compared to a baseline system which is not trained on oral cancer speech. A detailed analysis of the speech recognition results shows that (1) plosives and certain vowels are the most difficult sounds to recognise in oral cancer speech — this problem is successfully alleviated by our proposed approaches; (3) however these sounds are also relatively poorly recognised in the case of healthy speech with the exception of/p/. (2) recognition performance of certain phonemes is strongly data-dependent; (4) In terms of the manner of articulation, E2E performs better with the exception of vowels — however, vowels have a large contribution to overall performance. As for the place of articulation, vowels, labiodentals, dentals and glottals are better captured by hybrid models, E2E is better on bilabial, alveolar, postalveolar, palatal and velar information. (5) Finally, our analysis provides some guidelines for selecting words that can be used as voice commands for ASR systems for oral cancer speakers.

Country

Netherlands

Related Organizations

Delft University of Technology
Netherlands
Amsterdam UMC
Netherlands
Antoni van Leeuwenhoek Hospital
Netherlands
University of Amsterdam
Netherlands

Keywords

Oral cancer, automatic speech recognition, Pathological speech, Automatic speech recognition, oral cancer, Phoneme analysis, 410, 004, Low-resource

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	10
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%