
In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity (CFP) representation using discrete z-transform. Second, the vocal and non-vocal segments with extremely short duration are uncommon. To ensure a more stable melody contour, we design a differentiable loss function that prevents the model from predicting such segments. We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network. Our experimental results demonstrate that the proposed modifications are empirically effective for singing melody extraction.
7 pages, 4 figures, 2 tables, Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Artificial Intelligence, [INFO] Computer Science [cs], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Computer Science - Sound, Machine Learning (cs.LG), Multimedia (cs.MM), Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer Science - Artificial Intelligence, [INFO] Computer Science [cs], [INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD], [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Computer Science - Sound, Machine Learning (cs.LG), Multimedia (cs.MM), Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
