Downloads provided by UsageCounts
doi: 10.5281/zenodo.8328558 , 10.5281/zenodo.8328559 , 10.48550/arxiv.2311.03621 , 10.5281/zenodo.8345314
arXiv: 2311.03621
handle: 10216/154771
doi: 10.5281/zenodo.8328558 , 10.5281/zenodo.8328559 , 10.48550/arxiv.2311.03621 , 10.5281/zenodo.8345314
arXiv: 2311.03621
handle: 10216/154771
https://aimc2023.pubpub.org/pub/latent-spaces-tonal-music Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings — Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions — in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL- divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability — i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.
FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG), Multimedia (cs.MM)
FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG), Multimedia (cs.MM)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
| views | 21 | |
| downloads | 15 |

Views provided by UsageCounts
Downloads provided by UsageCounts