publication . Conference object . Preprint . Other literature type . 2018

Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms

Kim, Taejun; Lee, Jongpil; Nam, Juhan;
Open Access
  • Published: 01 Apr 2018
  • Publisher: IEEE
Abstract
Recent work has shown that the end-to-end approach using convolutional neural network (CNN) is effective in various types of machine learning tasks. For audio signals, the approach takes raw waveforms as input using an 1-D convolution layer. In this paper, we improve the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it. We compare different combinations of the modules in building CNN architectures. The results show that they achieve significant improvements over previous state-of-the-art models on the MagnaTagATune datase...
Subjects
free text keywords: Computer Science - Sound, Computer Science - Learning, Computer Science - Multimedia, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
23 references, page 1 of 2

[1] Keunwoo Choi, Gyo¨rgy Fazekas, Mark Sandler, and Kyunghyun Cho, “Convolutional recurrent neural networks for music classification,” in ICASSP. IEEE, 2017, pp. 2392-2396. [OpenAIRE]

[2] Jongpil Lee and Juhan Nam, “Multi-level and multiscale feature aggregation using pretrained convolutional neural networks for music auto-tagging,” IEEE Signal Processing Letters, vol. 24, no. 8, pp. 1208-1212, 2017.

[3] Sander Dieleman and Benjamin Schrauwen, “Multiscale approaches to music audio feature learning,” in International Society of Music Information Retrieval Conference (ISMIR), 2013, pp. 116-121.

[4] Aa¨ron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen, “Transfer learning by supervised pretraining for audio-based music classification,” in International Society of Music Information Retrieval Conference (ISMIR), 2014.

[5] Keunwoo Choi, Gyo¨rgy Fazekas, and Mark B. Sandler, “Automatic tagging using deep convolutional neural networks,” in International Society of Music Information Retrieval Conference (ISMIR), 2016. [OpenAIRE]

[6] Sander Dieleman and Benjamin Schrauwen, “End-toend learning for music audio,” in ICASSP. IEEE, 2014, pp. 6964-6968. [OpenAIRE]

[7] Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das, “Very deep convolutional neural networks for raw waveforms,” in ICASSP. IEEE, 2017, pp. 421-425.

[8] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam, “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms,” in Sound and Music Computing Conference (SMC), 2017. [OpenAIRE]

[9] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 630-645.

[12] Jie Hu, Li Shen, and Gang Sun, “Squeeze-andexcitation networks,” arXiv preprint arXiv:1709.01507, 2017.

[13] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson, “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131-135.

[14] Jongpil Lee and Juhan Nam, “Multi-level and multiscale feature aggregation using sample-level deep convolutional neural networks for music classification,” Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2017. [OpenAIRE]

[15] Yi Sun, Xiaogang Wang, and Xiaoou Tang, “Deep learning face representation from predicting 10,000 classes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1891-1898.

23 references, page 1 of 2
Abstract
Recent work has shown that the end-to-end approach using convolutional neural network (CNN) is effective in various types of machine learning tasks. For audio signals, the approach takes raw waveforms as input using an 1-D convolution layer. In this paper, we improve the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it. We compare different combinations of the modules in building CNN architectures. The results show that they achieve significant improvements over previous state-of-the-art models on the MagnaTagATune datase...
Subjects
free text keywords: Computer Science - Sound, Computer Science - Learning, Computer Science - Multimedia, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
23 references, page 1 of 2

[1] Keunwoo Choi, Gyo¨rgy Fazekas, Mark Sandler, and Kyunghyun Cho, “Convolutional recurrent neural networks for music classification,” in ICASSP. IEEE, 2017, pp. 2392-2396. [OpenAIRE]

[2] Jongpil Lee and Juhan Nam, “Multi-level and multiscale feature aggregation using pretrained convolutional neural networks for music auto-tagging,” IEEE Signal Processing Letters, vol. 24, no. 8, pp. 1208-1212, 2017.

[3] Sander Dieleman and Benjamin Schrauwen, “Multiscale approaches to music audio feature learning,” in International Society of Music Information Retrieval Conference (ISMIR), 2013, pp. 116-121.

[4] Aa¨ron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen, “Transfer learning by supervised pretraining for audio-based music classification,” in International Society of Music Information Retrieval Conference (ISMIR), 2014.

[5] Keunwoo Choi, Gyo¨rgy Fazekas, and Mark B. Sandler, “Automatic tagging using deep convolutional neural networks,” in International Society of Music Information Retrieval Conference (ISMIR), 2016. [OpenAIRE]

[6] Sander Dieleman and Benjamin Schrauwen, “End-toend learning for music audio,” in ICASSP. IEEE, 2014, pp. 6964-6968. [OpenAIRE]

[7] Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das, “Very deep convolutional neural networks for raw waveforms,” in ICASSP. IEEE, 2017, pp. 421-425.

[8] Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam, “Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms,” in Sound and Music Computing Conference (SMC), 2017. [OpenAIRE]

[9] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 630-645.

[12] Jie Hu, Li Shen, and Gang Sun, “Squeeze-andexcitation networks,” arXiv preprint arXiv:1709.01507, 2017.

[13] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson, “Cnn architectures for large-scale audio classification,” in ICASSP. IEEE, 2017, pp. 131-135.

[14] Jongpil Lee and Juhan Nam, “Multi-level and multiscale feature aggregation using sample-level deep convolutional neural networks for music classification,” Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2017. [OpenAIRE]

[15] Yi Sun, Xiaogang Wang, and Xiaoou Tang, “Deep learning face representation from predicting 10,000 classes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1891-1898.

23 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue