
The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders - Audio and Embedding Files This dataset contains the audio files and the embeddings for the paper "The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders" by Sauter el al. (2026). The corresponding code can be found here: https://github.com/adrian-sauter/visual_grounding_speech_analysis. audio_files.zip contains the audio files from MALD [1] and LibriSpeech [2] that were used in our work. fast_vgs_plus_librispeech_audioslicing.pkl.zip and w2v2_LibriSpeech_audioslicing.pkl.zip contain the embeddings for words from LibriSpeech (obtained via audio-slicing) for the FaST-VGS+ [3] and the wav2vec2 [4] model. FULL_DF_MALD.pkl.zip contains the embeddings for words from MALD for FaST-VGS+ [3], wav2vec2 [4], GloVe [5], BERT [6], and VG-BERT [7]. References [1] Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadić, F., & Sims, M. (2019). The massive auditory lexical decision (MALD) database. Behavior research methods, 51, 1187-1204. [2] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210. [3] Peng, P. & Harwath, D. (2022). Self-supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. Proceedings of the AAAI Symposium on AI for Speech and Audio Processing. [4] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33, 12449-12460. [5] Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543. [6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). [7] Zhang, Y., Choi, M., Han, K., & Liu, Z. (2021). Explainable semantic space by grounding language to vision with cross-modal contrastive learning. Advances in Neural Information Processing Systems, 34, 18513-18526.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
