Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Name: Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
Keywords: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

Yang Ai; Zhen-Hua Ling

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

IEEE/ACM Transactions on Audio Speech and Language Processing

Article . 2024 . Peer-reviewed

License: IEEE Copyright

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2024Embargo end date: 01 Jan 2024Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 32, pages 2,283-2,296 (issn: 2329-9290, eissn: 2329-9304,

Copyright policy )

Authors: Yang Ai; Zhen-Hua Ling;

doi: 10.1109/taslp.2024.3385285 , 10.48550/arxiv.2403.17378

arXiv: 2403.17378

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.

Accepted by IEEE Transactions on Audio, Speech and Language Processing. arXiv admin note: substantial text overlap with arXiv:2211.15974

Related Organizations

University of Science and Technology of China
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing

2 Research products, page 1 of 1

hifi-gan software on GitHub
IsRelatedTo
LL-NSPP software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

2 Research products, page 1 of 1

hifi-gan software on GitHub

LL-NSPP software on GitHub