Robustness of Continuous vs. Discrete Action Representations in Multimodal Video-Language Models under Synthetic Visual Occlusion

SOVEREIGN Research Kernel

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Report

Data sources: ZENODO

Robustness of Continuous vs. Discrete Action Representations in Multimodal Video-Language Models under Synthetic Visual Occlusion

descriptionPublicationkeyboard_double_arrow_right Report Under curation English Publisher:Zenodo

Authors: SOVEREIGN Research Kernel;

doi: 10.5281/zenodo.20651581

Robustness of Continuous vs. Discrete Action Representations in Multimodal Video-Language Models under Synthetic Visual Occlusion

- Summary

Abstract

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudolabels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an eResearch goal: How does the robustness of continuous latent action representations compare to discrete tokenization in multimodal video-language models under varying levels of synthetic visual occlusion?Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.

Found an issue? Give us feedback