Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Oxford University Re...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
DataBank, Bodleian Libraries, University of Oxford
Doctoral thesis . 2022
License: rioxx All Rights Reserved
Data sources: Datacite
versions View all 2 versions
addClaim

Self-supervised video representation learning

Authors: Han, T;

Self-supervised video representation learning

Abstract

Videos are an appealing source of data to train computer vision models. There exist almost infinite supplies of videos online, but exhaustive manual annotation is infeasible. The goal of this thesis is to learn strong video representations efficiently via self-supervised learning: a method that learns from the data rather than human annotations. The thesis is structured around three themes: (1) self-supervised learning for short-term videos, (2) efficient video representation learning, and (3) self- supervised learning for long-term videos. For short-term videos lasting only a few seconds, we show that predicting the video in the future is a strong learning signal at a large scale. We further show that strong video representations can be learned by taking two complementary modalities, namely RGB and optical flow, and using them to teach each other. For efficient video representation learning, we show that large-scale pre-trained vision-language models can be effectively adapted via a prompt tuning technique. We also show that dropping image patches can accelerate the finetuning of classification tasks and pre-training of video-language models. For long-term videos that last longer than a few minutes, we show that temporal alignment networks can be trained from the weak visual-textual correspondence within instructional videos. The resulting networks can automatically clean up the natural videos for effective vision-language training. In addition, we show that movie description models can be trained by leveraging the pre-trained vision- language models.

Country
United Kingdom
Keywords

Machine learning, Computer vision

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green
Related to Research communities