Transformer and LSTM Models for Automatic Counterpoint Generation using Raw Audio

A study investigating Transformer and LSTM models applied to raw audio for automatic generation of counterpoint was conducted. The dataset was a collection of raw audio waveforms of various pieces of Bach’s work, played on different instruments. Each song sentence was composed of four voices, and the aim was for the models to predict a missing voice from any subset of the remaining three voices. The research demonstrated the efficacy and behaviour of two deep learning (DL) architectures (the LSTM and Transformer), when applied to raw audio data, which are typically characterised by much longer sequences than symbolic music representations, such as MIDI. So far, the LSTM model has been the quintessential DL model for sequence-based tasks, such as generative audio models, but the research conducted in this study shows that the Transformer model can achieve competitive results. The mean absolute (MAE) and squared (MSE) errors were as follows: - Transformer: 1.0404+-0.003e-5 (MSE) & 7.6733+-0.2410e-4 (MAE) - LSTM: 1.0388+-0.004e-5 (MSE) & 7.9989+-0.5274e-4 (MAE). Both models achieved excellent performance, with very small MSE and MAE values. The LSTM model yielded a slightly smaller MSE on the test set, while the Transformer performed better with regards to MAE. Nevertheless, due to the very small differences between the two, it was difficult to conclude on a better model out of the two. Spectral plots of the targets and predictions were also investigated, as well as listening to audio files, for a couple of randomly selected test samples and one out-of-distribution sample. They showed that the models could in fact generate excellent predictions that were difficult to distinguish from the target samples, even for a musical piece that was not taken from the original dataset. Overall, we propose a novel application of the Transformer model for automatic counterpoint generation, which achieved results on par with the current state-of-the-art, represented by the LSTM model. Furthermore, the study investigates the respective prediction capabilities and propose new areas of research thought particularly interesting, such as analysing attention weights to improve human-computer interaction in musical systems. We proved the competitiveness of a different deep learning model, compared against recurrent architectures, for raw audio modelling. Having a plethora of models to choose from for a particular application is thought desirable, as certain features of particular architectures might be advantageous for different research problems.

Country

Norway

Related Organizations

Keywords

000, 004

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average