publication . Preprint . Conference object . 2020

Phase recovery with Bregman divergences for audio source separation

Magron, Paul; Vial, Pierre-Hugo; Oberlin, Thomas; Févotte, Cédric;
Open Access English
  • Published: 10 Dec 2020
  • Publisher: HAL CCSD
  • Country: France
Abstract
International audience; Time-frequency audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a phase recovery algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has shown good performance in several recent works. This algorithm minimizes a quadratic reconstruction error between magnitude spectrograms. However, this loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. In this paper, we propose to...
Subjects
arXiv: Computer Science::Sound
free text keywords: Phase recovery, Bregman divergences, projected gradient descent, audio source separation, speech enhancement, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
30 references, page 1 of 2

[1] P. Comon and C. Jutten, Handbook of blind source separation: independent component analysis and applications, Academic press, 2010.

[2] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal, “The fifth 'chime' speech separation and recognition challenge: Dataset, task and baselines,” in Proc. Interspeech 2018, September 2018.

[3] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. St¨oter, “Musical source separation: An introduction,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 31-40, January 2019.

[4] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, October 2018.

[5] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256- 1266, August 2019.

[6] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2020.

[7] D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via Tasnet,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2020.

[8] P. Magron, K. Drossos, S. I. Mimilakis, and T. Virtanen, “Reducing interference with phase recovery in DNN-based monaural singing voice separation,” in Proc. Interspeech, September 2018. [OpenAIRE]

[9] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 55-66, March 2015. [OpenAIRE]

[10] Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. Interspeech, September 2018.

[11] G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for singlechannel speech separation,” in Proc. International Workshop on Acoustic Signal Enhancement (IWAENC), September 2018. [OpenAIRE]

[12] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019.

[13] D. Gunawan and D. Sen, “Iterative phase estimation for the synthesis of separated sources from single-channel mixtures,” IEEE Signal Processing Letters, vol. 17, no. 5, pp. 421-424, May 2010.

[14] R. Gray, A. Buzo, A. Gray, and Y. Matsuyama, “Distortion measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 367-376, August 1980. [OpenAIRE]

[15] R. Hennequin, B. David, and R. Badeau, “Beta-divergence as a subclass of Bregman divergence,” IEEE Signal Processing Letters, vol. 18, no. 2, pp. 83-86, February 2011.

30 references, page 1 of 2
Abstract
International audience; Time-frequency audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a phase recovery algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has shown good performance in several recent works. This algorithm minimizes a quadratic reconstruction error between magnitude spectrograms. However, this loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. In this paper, we propose to...
Subjects
arXiv: Computer Science::Sound
free text keywords: Phase recovery, Bregman divergences, projected gradient descent, audio source separation, speech enhancement, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
30 references, page 1 of 2

[1] P. Comon and C. Jutten, Handbook of blind source separation: independent component analysis and applications, Academic press, 2010.

[2] Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal, “The fifth 'chime' speech separation and recognition challenge: Dataset, task and baselines,” in Proc. Interspeech 2018, September 2018.

[3] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. St¨oter, “Musical source separation: An introduction,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 31-40, January 2019.

[4] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702-1726, October 2018.

[5] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256- 1266, August 2019.

[6] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2020.

[7] D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via Tasnet,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2020.

[8] P. Magron, K. Drossos, S. I. Mimilakis, and T. Virtanen, “Reducing interference with phase recovery in DNN-based monaural singing voice separation,” in Proc. Interspeech, September 2018. [OpenAIRE]

[9] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 55-66, March 2015. [OpenAIRE]

[10] Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Proc. Interspeech, September 2018.

[11] G. Wichern and J. Le Roux, “Phase reconstruction with learned time-frequency representations for singlechannel speech separation,” in Proc. International Workshop on Acoustic Signal Enhancement (IWAENC), September 2018. [OpenAIRE]

[12] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019.

[13] D. Gunawan and D. Sen, “Iterative phase estimation for the synthesis of separated sources from single-channel mixtures,” IEEE Signal Processing Letters, vol. 17, no. 5, pp. 421-424, May 2010.

[14] R. Gray, A. Buzo, A. Gray, and Y. Matsuyama, “Distortion measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 367-376, August 1980. [OpenAIRE]

[15] R. Hennequin, B. David, and R. Badeau, “Beta-divergence as a subclass of Bregman divergence,” IEEE Signal Processing Letters, vol. 18, no. 2, pp. 83-86, February 2011.

30 references, page 1 of 2
Any information missing or wrong?Report an Issue