MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection - 3IA Côte d’Azur – Interdisciplinary Institute for Artificial Intelligence Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Résumé

Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos. Such data consist of complex temporal relations including composite or co-occurring actions. To detect actions in these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To this end, we propose a novel 'ConvTransformer' network for action detection: MS-TCT 1. This network comprises of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at multiple temporal resolutions, (2) a Temporal Scale Mixer module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification module which learns a center-relative position of each action instance in time, and predicts frame-level classification scores. Our experimental results on multiple challenging datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.
Fichier principal
Vignette du fichier
CVPR2022_MSTCT.pdf (2.83 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03682969 , version 1 (31-05-2022)

Identifiants

  • HAL Id : hal-03682969 , version 1

Citer

Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael Ryoo, Francois F Bremond. MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. CVPR - Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States. ⟨hal-03682969⟩
108 Consultations
79 Téléchargements

Partager

Gmail Facebook X LinkedIn More