Réseaux de neurones profonds pour la séparation des sources et la reconnaissance robuste de la parole

Doctoral thesis English OPEN
Nugraha , Aditya Arie;
(2017)
  • Publisher: HAL CCSD
  • Subject: Réseaux de neurones profonds | [ INFO.INFO-TS ] Computer Science [cs]/Signal and Image Processing | Multichannel audio source separation | Multichannel Gaussian model | Séparation de sources audio multicanale | Deep neural networks | Modèle gaussien multicanal

This thesis addresses the problem of multichannel audio source separation by exploiting deep neural networks (DNNs). We build upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources... View more
  • References (156)
    156 references, page 1 of 16

    C.1 Performance de reconnaissance vocale en termes de WER (%) utilisant différentes méthodes de rehaussement. . . . . . . . . 6 D.2 Performance en reconnaissance de la parole en terme de WER (%) en utilisant différentes fonctions de coût . . . . . . . . . . . 8 D.3 Performance de la séparation de sources pour la tâche MUS de SiSEC 2016. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 E.4 Comparaison des performance de reconnaissance en terme de WER (%) pour différents nombres de canaux par rapport à la formation de voies GEV-BAN. . . . . . . . . . . . . . . . . . . . 11

    3.1 Source separation performance metrics (in dB) of the multichannel NMF and the multichannel DNN based systems. . . . 78 3.2 Speech recognition performance in terms of WER (%) using different enhancement methods. . . . . . . . . . . . . . . . . . 80

    4.1 Speech recognition performance in terms of WER (%) using the different cost functions. . . . . . . . . . . . . . . . . . . . . 97 4.2 Speech recognition performance in terms of WER (%) of the proposed system using the IS divergence. . . . . . . . . . . . . 98 4.3 Comparison of the different deep neural networks used in Section 4.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Speech recognition performance in terms of WER (%) using the different time-frequency representations, the different DNN architectures, and the different training datasets. . . . . . . . . 103 4.5 Comparison of the different RNNs used in Section 4.5. . . . . 108 4.6 Comparison of the different DNN training data settings used in Section 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7 Source separation performance comparison to the oracle setting and the state-of-the-art for the MUS task of SiSEC 2016. . 115

    5.1 Speech recognition performance in terms of WER (%) using the different spatial parameter estimations. . . . . . . . . . . . 123 5.2 Comparison of the different spatial DNNs used in the different speech enhancement tasks in Chapter 5. . . . . . . . . . . . . . 133 C.1 Système DNN proposé pour le rehaussement multicanal de la parole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1 Illustration of a studio music production. . . . . . . . . . . . . 23 2.2 Block diagram of ASR. . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 McCulloch-Pitts neuron model. . . . . . . . . . . . . . . . . . . 36 2.4 Typical usages of DNNs for single-channel audio source separation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Benesty, J., Makino, S., & Chen, J. (2005). Speech Enhancement. Springer.

    Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H., & Klapuri, A. (2013). Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems, 41(3), 407-434.

    Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science chapter 19, (pp. 437-478). Springer.

    Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS) (pp. 153-160). Vancouver, Canada.

    Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., & Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy) Austin, USA. Oral presentation.

    Bertin, N., Févotte, C., & Badeau, R. (2009). A tempering approach for ItakuraSaito non-negative matrix factorization. with application to music transcription. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 1545-1548). Taipei, Taiwan.

  • Related Research Results (2)
  • Metrics
Share - Bookmark