Skip to Main content Skip to Navigation
New interface
Conference papers

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Abstract : Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos. Such data consist of complex temporal relations including composite or co-occurring actions. To detect actions in these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To this end, we propose a novel 'ConvTransformer' network for action detection: MS-TCT 1. This network comprises of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at multiple temporal resolutions, (2) a Temporal Scale Mixer module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification module which learns a center-relative position of each action instance in time, and predicts frame-level classification scores. Our experimental results on multiple challenging datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.
Complete list of metadata
Contributor : Rui DAI Connect in order to contact the contributor
Submitted on : Tuesday, May 31, 2022 - 2:14:29 PM
Last modification on : Friday, November 18, 2022 - 9:28:17 AM


Files produced by the author(s)


  • HAL Id : hal-03682969, version 1


Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael Ryoo, Francois F Bremond. MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. CVPR - Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States. ⟨hal-03682969⟩



Record views


Files downloads