
handle: 10214/28683
This thesis introduces two novel approaches for Temporal Action Localization (TAL) in video understanding. The Long-Short-range Adapter (LoSA) is a memory-efficient backbone adapter for untrimmed videos, modifying intermediate layers across various temporal ranges to enhance video features. It enables end-to-end adaptation of billion-parameter models like VideoMAEv2. The OVFormer framework addresses Open-Vocabulary TAL by generating rich class descriptions using a language model, aligning these with video features through cross-attention, and employing a two-stage training strategy for novel category generalization. LoSA enables efficient use of state-of-the-art video models, while OVFormer expands recognizable actions beyond predefined categories. These contributions significantly advance TAL, enhancing both capability and flexibility in action recognition and paving the way for more versatile video understanding systems.
action recognition, Temporal Action Localization, video understanding
action recognition, Temporal Action Localization, video understanding
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
