
This article focuses on cross-modal video retrieval, a technology with wide-ranging applications across media networks, security organizations, and even individuals managing large personal video collections. The authors discuss the concept of cross-modal video learning and offer an overview of deep neural network architectures in the literature, focusing on methods combining visual and textual representations for cross-modal video retrieval. They also examine the impact of vision transformers, a learning paradigm significantly improving cross-modal learning performance. Also, they present a novel cross-modal network architecture for free-text video retrieval called T×V+Objects. This method extends an existing state-of-the-art network by incorporating object-based video encoding using transformers. It leverages multiple latent spaces and combines detected objects with textual features, creating a joint embedding space for improved text-video similarity.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
