publication . Preprint . Conference object . 2018

Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation

Paul Magron; Konstantinos Drossos; Stylianos Ioannis Mimilakis; Tuomas Virtanen;
Open Access English
  • Published: 02 Sep 2018
  • Publisher: ISCA
Abstract
International audience; State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-term Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithm...
Persistent Identifiers
Subjects
free text keywords: Monaural singing voice separation, phase recovery, deep neural networks, MaD TwinNet, Wiener filtering, [ SPI.SIGNAL ] Engineering Sciences [physics]/Signal and Image processing, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Sinusoidal model, Fourier transform, symbols.namesake, symbols, Redundancy (engineering), Wiener filter, Short-time Fourier transform, Speech recognition, Monaural, Interference (wave propagation), Magnitude (mathematics), Computer science
Funded by
EC| MacSeNet
Project
MacSeNet
Machine Sensing Training Network
  • Funder: European Commission (EC)
  • Project Code: 642685
  • Funding stream: H2020 | MSCA-ITN-ETN
Validated by funder
,
EC| EVERYSOUND
Project
EVERYSOUND
Computational Analysis of Everyday Soundscapes
  • Funder: European Commission (EC)
  • Project Code: 637422
  • Funding stream: H2020 | ERC | ERC-STG
Validated by funder
29 references, page 1 of 2

[1] P. Comon and C. Jutten, Handbook of blind source separation: independent component analysis and applications. Academic press, 2010. [OpenAIRE]

[2] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066-1074, March 2007.

[3] C. Fe´votte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793-830, March 2009.

[4] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. Daudet, “Kernel additive models for source separation,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4298-4310, August 2014. [OpenAIRE]

[5] P.-S. Huang, M. K. M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[6] A. Liutkus and R. Badeau, “Generalized Wiener filtering with fractional power spectrograms,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015. [OpenAIRE]

[7] P. Magron, R. Badeau, and B. David, “Model-based STFT phase recovery for audio source separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 26, no. 5, May 2018.

[8] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction,” in Proc. ISCA Workshop on Statistical and Perceptual Audition (SAPA), September 2008.

[9] J. Le Roux and E. Vincent, “Consistent Wiener filtering for audio source separation,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 217-220, March 2013.

[10] J. Le Roux, H. Kameoka, E. Vincent, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF under spectrogram consistency constraints,” in Proc. Acoustical Society of Japan Autumn Meeting, September 2009. [OpenAIRE]

[11] J. Bronson and P. Depalle, “Phase constrained complex NMF: Separating overlapping partials in mixtures of harmonic musical sources,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[12] E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural networks for single channel source separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[13] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Two-stage single-channel audio source separation using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1773-1783, September 2017.

[14] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1652-1664, September 2016.

[15] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band DenseNets for audio source separation,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2017. [OpenAIRE]

29 references, page 1 of 2
Abstract
International audience; State-of-the-art methods for monaural singing voice separation consist in estimating the magnitude spectrum of the voice in the short-term Fourier transform (STFT) domain by means of deep neural networks (DNNs). The resulting magnitude estimate is then combined with the mixture's phase to retrieve the complex-valued STFT of the voice, which is further synthesized into a time-domain signal. However, when the sources overlap in time and frequency, the STFT phase of the voice differs from the mixture's phase, which results in interference and artifacts in the estimated signals. In this paper, we investigate on recent phase recovery algorithm...
Persistent Identifiers
Subjects
free text keywords: Monaural singing voice separation, phase recovery, deep neural networks, MaD TwinNet, Wiener filtering, [ SPI.SIGNAL ] Engineering Sciences [physics]/Signal and Image processing, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Sinusoidal model, Fourier transform, symbols.namesake, symbols, Redundancy (engineering), Wiener filter, Short-time Fourier transform, Speech recognition, Monaural, Interference (wave propagation), Magnitude (mathematics), Computer science
Funded by
EC| MacSeNet
Project
MacSeNet
Machine Sensing Training Network
  • Funder: European Commission (EC)
  • Project Code: 642685
  • Funding stream: H2020 | MSCA-ITN-ETN
Validated by funder
,
EC| EVERYSOUND
Project
EVERYSOUND
Computational Analysis of Everyday Soundscapes
  • Funder: European Commission (EC)
  • Project Code: 637422
  • Funding stream: H2020 | ERC | ERC-STG
Validated by funder
29 references, page 1 of 2

[1] P. Comon and C. Jutten, Handbook of blind source separation: independent component analysis and applications. Academic press, 2010. [OpenAIRE]

[2] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066-1074, March 2007.

[3] C. Fe´votte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793-830, March 2009.

[4] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. Daudet, “Kernel additive models for source separation,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4298-4310, August 2014. [OpenAIRE]

[5] P.-S. Huang, M. K. M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[6] A. Liutkus and R. Badeau, “Generalized Wiener filtering with fractional power spectrograms,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015. [OpenAIRE]

[7] P. Magron, R. Badeau, and B. David, “Model-based STFT phase recovery for audio source separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 26, no. 5, May 2018.

[8] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction,” in Proc. ISCA Workshop on Statistical and Perceptual Audition (SAPA), September 2008.

[9] J. Le Roux and E. Vincent, “Consistent Wiener filtering for audio source separation,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 217-220, March 2013.

[10] J. Le Roux, H. Kameoka, E. Vincent, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF under spectrogram consistency constraints,” in Proc. Acoustical Society of Japan Autumn Meeting, September 2009. [OpenAIRE]

[11] J. Bronson and P. Depalle, “Phase constrained complex NMF: Separating overlapping partials in mixtures of harmonic musical sources,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[12] E. M. Grais, M. U. Sen, and H. Erdogan, “Deep neural networks for single channel source separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014.

[13] E. M. Grais, G. Roma, A. J. R. Simpson, and M. D. Plumbley, “Two-stage single-channel audio source separation using deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1773-1783, September 2017.

[14] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1652-1664, September 2016.

[15] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band DenseNets for audio source separation,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), October 2017. [OpenAIRE]

29 references, page 1 of 2
Any information missing or wrong?Report an Issue