Structured modeling and recognition of human actions in video

Guilhem Chéron 1, 2
1 WILLOW - Models of visual object recognition and scene understanding
DI-ENS - Département d'informatique de l'École normale supérieure, Inria de Paris
2 Thoth - Apprentissage de modèles à partir de données massives
LJK - Laboratoire Jean Kuntzmann, Inria Grenoble - Rhône-Alpes
Abstract : Automatic video understanding is expected to impact our lives through many applications such as autonomous driving, domestic robots, content search and filtering, gaming, defense or security. Video content is growing faster each year, for example on platforms such as YouTube, Twitter or Facebook. Automatic analysis of this data is required to enable future applications. Video analysis, especially in uncontrolled environments, presents several difficulties such as intra-class variability (samples from the same concept appear very differently) or inter-class confusion (examples from two different activities look similar). While these problems can be addressed with the supervised learning algorithms, fully-supervised methods are often associated with high annotation cost. Depending on both the task and the level of required supervision, the annotation can be prohibitive. For example, in action localization, a fully-supervised approach demands person bounding boxes to be annotated at every frames where an activity is performed. The cost of getting such annotation prohibits scalability and limits the number of training samples. Another issue is finding a consensus between annotators, which leads to labeling ambiguities (where does the action start? where does it end? what should be included in the bounding box? etc.). This thesis addresses above problems in the context of two tasks, namely human action classification and localization. The former aims at recognizing the type of activity performed in a short video clip trimmed to the temporal extent of the action. The latter additionally extracts the space-time locations of potentially multiple activities in much longer videos. Our approach to action classification leverages information from human pose and integrates it with appearance and motion descriptors for improved performance. Our approach to action localization models the temporal evolution of actions in the video with a recurrent network trained on the level of person tracks. Finally, the third method in this thesis aims to avoid a prohibitive cost of video annotation and adopts discriminative clustering to analyze and combine different levels of supervision.
Complete list of metadatas

Cited literature [9 references]  Display  Hide  Download
Contributor : Guilhem Chéron <>
Submitted on : Wednesday, January 9, 2019 - 12:02:06 PM
Last modification on : Tuesday, January 29, 2019 - 3:05:42 PM


Files produced by the author(s)


  • HAL Id : tel-01975247, version 1



Guilhem Chéron. Structured modeling and recognition of human actions in video. Computer Vision and Pattern Recognition [cs.CV]. Ecole normale supérieure - ENS PARIS, 2018. English. ⟨tel-01975247⟩



Record views


Files downloads