Multimodal Variational Autoencoder: A Barycentric View

Name: Multimodal Variational Autoencoder: A Barycentric View
Keywords: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Information Theory, Computer Vision and Pattern Recognition (cs.CV), Information Theory (cs.IT), Computer Science - Computer Vision and Pattern Recognition, Article, Machine Learning (cs.LG)

Qiu, Peijie; Zhu, Wenhui; Kumar, Sayantan; Chen, Xiwen; Sun, Xiaotong; Yang, Jin; Razi, Abolfazl; Wang, Yalin; Sotiras, Aristeidis

Found an issue? Give us feedback

PubMed Centralarrow_drop_down

PubMed Central

Other literature type . 2025

Data sources: PubMed Central

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

Proceedings of the AAAI Conference on Artificial Intelligence

Article . 2025 . Peer-reviewed

Data sources: Crossref

https://dx.doi.org/10.48550/ar...

Article . 2024

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

Multimodal Variational Autoencoder: A Barycentric View

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type , Preprint 11 Apr 2025Embargo end date: 01 Jan 2024Publisher:Association for the Advancement of Artificial Intelligence (AAAI)Journal:Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20,060-20,068 (issn: 2159-5399, eissn: 2374-3468,

Copyright policy )Funded by:NIH | HIGH PERFORMANCE BIOMEDIC..., NIH | Advanced machine learning..., NIH | Acquisition of a next-gen... +1 projects

Authors: Qiu, Peijie; Zhu, Wenhui; Kumar, Sayantan; Chen, Xiwen; Sun, Xiaotong; Yang, Jin; Razi, Abolfazl; +2 Authors

doi: 10.1609/aaai.v39i19.34209 , 10.48550/arxiv.2412.20487

arXiv: 2412.20487

Multimodal Variational Autoencoder: A Barycentric View

- Summary
- Subjects
- Metrics

Abstract

Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

Related Organizations

Clemson University
Washington State University
United States
University of Arkansas at Fayetteville
United States
Clemson University
Washington University in St. Louis
United States

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Information Theory, Computer Vision and Pattern Recognition (cs.CV), Information Theory (cs.IT), Computer Science - Computer Vision and Pattern Recognition, Article, Machine Learning (cs.LG)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

2

Average

Green

Funded by

NIH| HIGH PERFORMANCE BIOMEDICAL IMAGING COMPUTER RESOURCES, NIH| Advanced machine learning algorithms that integrate multi-modal neuroimaging to quantify the heterogeneity in Alzheimer's Disease, NIH| Acquisition of a next-generation computing cluster, NIH| GPU COMPUTING RESOURCE TO ENABLE INNOVATION IN IMAGING AND NETWORK BIOLOGY