
arXiv: 2506.18532
Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners’ speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partialcascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed which aims to remove hypothesised edits resulting from fluent transcription errors. Finally, these approaches are combined with an edit-confidence estimation approach, to exclude low confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost end-to-end SGEC performance.
Machine Learning, FOS: Computer and information sciences, Sound (cs.SD), Sound, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language, Computation and Language (cs.CL), Audio and Speech Processing, Machine Learning (cs.LG)
Machine Learning, FOS: Computer and information sciences, Sound (cs.SD), Sound, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language, Computation and Language (cs.CL), Audio and Speech Processing, Machine Learning (cs.LG)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
