for all types of supervision on UCF101-24. These results are presented in Table 6.2 (our w. tracks of, 2016. ,
we observe only a small difference in performance between the "Temporal" and the "Temporal + 1 BB" setups. This suggests an interesting conclusion that the spatial supervision for action localization may not always be necessary. Back to Table 6.1, note that, even if not the main focus of our work, our performance on the fully supervised setting is on par with the recent work, vol.37, p.4, 2016. ,
, This consolidates the conclusions drawn above about the alternative way of annotating videos ('temporal click') and what information is important (is the spatial supervision always necessary?). However, we are still below the current state of the art
Unsupervised learning from narrated instruction videos, CVPR. 43, vol.44, p.139, 2016. ,
DOI : 10.1109/cvpr.2016.495
URL : https://hal.archives-ouvertes.fr/hal-01171193
Joint Discovery of Object States and Manipulation Actions, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01676084
2d human pose estimation: New benchmark and state of the art analysis, 2014. ,
Multiscale combinatorial grouping, CVPR, vol.38, p.111, 2014. ,
Sequential deep learning for human action recognition, International Workshop on Human Behavior Understanding, p.34, 2011. ,
DOI : 10.1007/978-3-642-25446-8_4
URL : https://hal.archives-ouvertes.fr/hal-01354493
DIFFRAC: A discriminative and flexible framework for clustering, NIPS. 42, vol.47, p.117, 2007. ,
Social scene understanding: End-to-end multi-person action localization and collective activity recognition, CVPR, vol.38, p.76, 2017. ,
Neural machine translation by jointly learning to align and translate, 2014. ,
Speeded-up robust features (SURF), Comput. Vis. Image Underst, p.61, 2008. ,
Surf: Speeded up robust features, ECCV, p.27, 2006. ,
Poof: Part-based one-vs.-one features for finegrained categorization, face verification, and attribute estimation, 2013. ,
Fullyconvolutional siamese networks for object tracking, BMVC, p.86, 2016. ,
Finding actors and actions in movies, ICCV. 42, vol.48, p.139, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00904991
Weakly supervised action labeling in videos under ordering constraints, ECCV. 43, vol.49, p.141, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01053967
Weakly-supervised alignment of video with text, ICCV, p.43, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01154523
Online learning and stochastic approximations. On-line learning in neural networks, p.33, 1998. ,
High accuracy optical flow estimation based on a theory for warping, ECCV, vol.29, p.84, 2004. ,
SST: Singlestream temporal action proposals, 2017. ,
Fast temporal activity proposals for efficient detection of human actions in untrimmed videos, 2016. ,
Deep clustering for unsupervised learning of visual features, 2018. ,
Quo vadis, action recognition? A new model and the Kinetics dataset, CVPR, vol.34, p.118, 2017. ,
The devil is in the details: an evaluation of recent feature encoding methods, BMVC, vol.59, p.61, 2011. ,
Return of the devil in the details: Delving deep into convolutional nets, 2014. ,
Action Detection by implicit intentional Motion Clustering, ICCV, vol.44, p.112, 2015. ,
Articulated pose estimation by a graphical model with image dependent pairwise relations, NIPS. 33, vol.54, p.137, 2014. ,
Mixing body-part sequences for human pose estimation, CVPR, vol.35, p.70, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00978643
A flexible model for training action localization with varying levels of supervision, NIPS. 20, vol.24, p.140, 2018. ,
, , 2015.
P-CNN: Pose-based CNN features for action recognition, ICCV. 20, vol.21, p.138, 2015. ,
Modeling spatio-temporal human track structure for action localization, vol.20, p.23, 2018. ,
Detecting parts for action localization, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01573629
Learning phrase representations using rnn encoder-decoder for statistical machine translation, vol.75, p.80, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01433235
Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01110036
Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. ,
Support-vector networks. Machine learning, p.30, 1995. ,
Visual categorization with bags of keypoints, 2004. ,
Compact representation of bidirectional texture functions, 2001. ,
Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol.73, p.79, 2012. ,
Histograms of oriented gradients for human detection, CVPR, vol.27, p.61, 2005. ,
URL : https://hal.archives-ouvertes.fr/inria-00548512
Human detection using oriented histograms of flow and appearance, ECCV, vol.29, p.61, 2006. ,
URL : https://hal.archives-ouvertes.fr/inria-00548587
Sympathy for the details: dense trajectories and hybrid classification architectures for action recognition, vol.33, p.76, 2016. ,
Imagenet: A large-scale hierarchical image database, vol.57, p.137, 2009. ,
Long-term recurrent convolutional networks for visual recognition and description, vol.71, p.74, 2015. ,
The Yael library, ACM Multimedia, p.61, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01020695
Discovering localized attributes for fine-grained recognition, 2012. ,
Automatic annotation of human actions in video, vol.112, p.139, 2009. ,
Daps: Deep action proposals for action understanding, 2016. ,
Two-frame motion estimation based on polynomial expansion, SCIA, p.61, 2003. ,
A bayesian hierarchical model for learning natural scene categories, 2005. ,
Spatiotemporal residual networks for video action recognition, NIPS, vol.33, p.76, 2016. ,
Convolutional two-stream network fusion for video action recognition, CVPR, vol.33, p.76, 2016. ,
Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol.30, p.61, 1981. ,
An algorithm for quadratic programming, Naval Research Logistics Quarterly, vol.50, p.118, 1956. ,
Actom sequence models for efficient action detection, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00575217
Fast r-cnn, ICCV, vol.75, p.85, 2015. ,
Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.33, p.34, 2014. ,
, , vol.112, p.118, 2018.
Finding action tubes, CVPR. 38, vol.57, p.111, 2015. ,
, THUMOS challenge: Action recognition with a large number of classes, p.76, 2015.
Speech recognition with deep recurrent neural networks, 2013. ,
AVA: A video dataset of spatiotemporally localized atomic visual actions, CVPR. 39, 98, 99, vol.100, p.127, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01764300
A combined corner and edge detector, Alvey vision conference, p.27, 1988. ,
Deep residual learning for image recognition, CVPR, vol.97, p.118, 2016. ,
ActivityNet: A large-scale video benchmark for human activity understanding, vol.17, p.40, 2015. ,
Joint segmentation and classification of human actions in video, 2011. ,
Long short-term memory, 1997. ,
Determining optical flow, In Artificial intelligence, vol.29, 1981. ,
DOI : 10.1016/0004-3702(81)90024-2
Tube convolutional neural network (T-CNN) for action detection in videos, ICCV, vol.39, p.111, 2017. ,
Connectionist Temporal Modeling for Weakly Supervised Action Labeling, ECCV. 43, vol.112, p.141, 2016. ,
DOI : 10.1007/978-3-319-46493-0_9
URL : http://arxiv.org/pdf/1607.08584
The THUMOS challenge on action recognition for videos "in the wild, Computer Vision and Image Understanding, p.40, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01431525
Action localization with tubelets from motion, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00996844
Aggregating local image descriptors into compact codes, 2012. ,
DOI : 10.1109/tpami.2011.235
URL : https://hal.archives-ouvertes.fr/inria-00633013
Towards understanding action recognition, ICCV. 35, vol.36, p.137, 2013. ,
DOI : 10.1109/iccv.2013.396
URL : https://hal.archives-ouvertes.fr/hal-00906902
3D convolutional neural networks for human action recognition, p.33, 2010. ,
Discriminative clustering for image cosegmentation, CVPR, vol.48, p.113, 2010. ,
DOI : 10.1109/cvpr.2010.5539868
URL : http://www.di.ens.fr/%7Efbach/cosegmentation_cvpr2010.pdf
Action tubelet detector for spatio-temporal action localization, ICCV. 39, 40, 75, vol.126, p.136, 2017. ,
DOI : 10.1109/iccv.2017.472
URL : https://hal.archives-ouvertes.fr/hal-01519812
Joint learning of object and action detectors, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01575804
Contextlocnet: Context-aware deep network models for weakly supervised localization, 2016. ,
DOI : 10.1007/978-3-319-46454-1_22
URL : https://hal.archives-ouvertes.fr/hal-01421772
Deep visual-semantic alignments for generating image descriptions, CVPR, vol.74, p.79, 2015. ,
, , 2014.
, Large-scale video classification with convolutional neural networks
The kinetics human action video dataset, vol.16, p.137, 2017. ,
Efficient visual event detection using volumetric features, ICCV, vol.38, p.111, 2005. ,
Adam: A method for stochastic optimization, 2014. ,
A spatio-temporal descriptor based on 3d-gradients, BMVC, p.29, 2008. ,
URL : https://hal.archives-ouvertes.fr/inria-00514853
ImageNet classification with deep convolutional neural networks, NIPS, vol.32, p.57, 2012. ,
HMDB: a large video database for human motion recognition, ICCV. 13, vol.17, p.54, 2011. ,
Block-coordinate Frank-Wolfe optimization for structural SVMs, ICML. 51, vol.113, p.118, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00720158
On space-time interest points, IJCV, vol.27, p.28, 2005. ,
Modeling and visual recognition of human actions and interactions. Habilitation à diriger des recherches, 2013. ,
URL : https://hal.archives-ouvertes.fr/tel-01064540
Local velocity-adapted motion events for spatio-temporal recognition. Computer vision and image understanding, vol.27, p.28, 2007. ,
Local descriptors for spatio-temporal recognition, ECCV workshop, p.27, 2004. ,
Learning realistic human actions from movies, CVPR, vol.55, p.61, 2008. ,
URL : https://hal.archives-ouvertes.fr/inria-00548659
Retrieving actions in movies, ICCV. 29, vol.38, p.111, 2007. ,
Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, CVPR, vol.29, p.30, 2006. ,
URL : https://hal.archives-ouvertes.fr/inria-00548585
Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.33, p.54, 1998. ,
Efficient backprop, Neural networks: Tricks of the trade, vol.33, 1998. ,
Microsoft coco: Common objects in context, 2014. ,
Spatio-temporal LSTM with trust gates for 3D human action recognition, 2016. ,
Ssd: Single shot multibox detector, vol.76, p.97, 2016. ,
Object recognition from local scale-invariant features, ICCV, p.27, 1999. ,
Visual relationship detection with language priors, 2016. ,
An iterative image registration technique with an application to stereo vision, IJCAI, vol.29, p.118, 1981. ,
Learning activity progression in LSTMs for activity detection and early detection, CVPR, vol.38, p.76, 2016. ,
Learning object representations for visual object class recognition, 2007. ,
, , 2012.
Face detection without bells and whistles, 2014. ,
Localizing Actions from Video labels and Pseudo-Annotations, BMVC, vol.44, p.126, 2017. ,
Spot on: Action localization from pointly-supervised proposals, ECCV. 45, 99, vol.110, p.125, 2016. ,
Learning from video and text via large-scale discriminative clustering, vol.113, p.118, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01569540
Learning a text-video embedding from incomplete and heterogeneous data, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01975102
A performance evaluation of local descriptors, p.27, 2005. ,
URL : https://hal.archives-ouvertes.fr/inria-00548227
Stacked hourglass networks for human pose estimation, ECCV. 33, vol.54, p.137, 2016. ,
Beyond short snippets: Deep networks for video classification, CVPR, vol.34, p.91, 2015. ,
Multiple granularity analysis for fine-grained action detection, 2014. ,
Sampling strategies for bag-of-features image classification, 2006. ,
URL : https://hal.archives-ouvertes.fr/hal-00203752
Spatio-temporal object detection proposals, ECCV, vol.38, p.111, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01021902
Action and event recognition with fisher vectors on a compact feature set, ICCV, vol.30, p.66, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00873662
Learning and transferring midlevel image representations using convolutional neural networks, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00911179
Minding the gaps for block Frank-Wolfe optimization of structured SVMs, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01323727
Multi-region two-stream R-CNN for action detection, HAL. 38, 75, vol.76, p.111, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01349107
Action recognition with stacked fisher vectors, 2014. ,
Large-scale image retrieval with compressed fisher vectors, 2010. ,
Improving the fisher kernel for large-scale image classification, ECCV, vol.30, p.61, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00548630
Weakly-supervised learning of visual relations, vol.17, p.138, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01576035
Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, IJCV, vol.34, p.76, 2017. ,
Parsing videos of actions with segmental grammars, 2014. ,
Poselet conditioned pictorial structures, CVPR, vol.35, p.55, 2013. ,
Explicit modeling of human-object interactions in realistic videos, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00720847
Faster R-CNN: Towards real-time object detection with region proposal networks, NIPS, vol.85, p.118, 2015. ,
Weakly supervised action learning with rnn based fine-to-coarse modeling, CVPR. 43, vol.112, p.141, 2017. ,
Action mach a spatio-temporal maximum average correlation height filter for action recognition, 2008. ,
A database for fine grained activity detection of cooking activities, CVPR. 13, vol.36, p.68, 2012. ,
Recognizing fine-grained and composite activities using hand-centric features and script data, 2016. ,
Learning internal representations by error propagation, Parallel distributed processing, p.33, 1985. ,
ImageNet large scale visual recognition challenge, IJCV, vol.32, p.118, 2015. ,
Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture, ICCV, vol.39, p.111, 2017. ,
Deep learning for detecting multiple space-time action tubes in videos, BMVC. 38, 39, 41, 75, vol.76, p.136, 2016. ,
Modec: Multimodal decomposable models for human pose estimation, CVPR, vol.35, p.59, 2013. ,
Parsing human motion with stretchable models, CVPR, vol.35, p.55, 2011. ,
Local grayvalue invariants for image retrieval, TPAMI, p.27, 1997. ,
URL : https://hal.archives-ouvertes.fr/inria-00548358
Recognizing human actions: A local SVM approach, ICPR, vol.27, p.55, 2004. ,
Temporal action localization in untrimmed videos via multi-stage CNNs, CVPR, vol.38, p.139, 2016. ,
Asynchronous temporal fields for action recognition, 2017. ,
Fisher vector faces in the wild, BMVC, vol.29, p.31, 2013. ,
Two-stream convolutional networks for action recognition in videos, NIPS. 33, vol.34, p.73, 2014. ,
A multi-stream bi-directional recurrent neural network for fine-grained action detection, CVPR, vol.34, p.76, 2016. ,
Online real time multiple spatiotemporal action localisation and prediction, vol.111, p.118, 2017. ,
DOI : 10.1109/iccv.2017.393
URL : http://arxiv.org/pdf/1611.08563
Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization, ICCV, vol.43, p.112, 2017. ,
Weakly Supervised Action Detection, BMVC, vol.44, p.112, 2011. ,
DOI : 10.5244/c.25.65
URL : http://www.bmva.org/bmvc/2011/proceedings/paper65/paper65.pdf
Video google: A text retrieval approach to object matching in videos, ICCV, p.29, 2003. ,
Unsupervised action discovery and localization in videos, ICCV, vol.44, p.112, 2017. ,
UCF101: A dataset of 101 human actions classes from videos in the wild, vol.17, p.137, 2012. ,
Unsupervised learning of video representations using LSTMs, 2016. ,
Generating text with recurrent neural networks, 2011. ,
Deepface: Closing the gap to human-level performance in face verification, 2014. ,
Motion words for videos, 2014. ,
Convolutional learning of spatio-temporal features, p.33, 2010. ,
Learning video object segmentation with visual memory, ICCV, vol.80, p.89, 2017. ,
DOI : 10.1109/iccv.2017.480
URL : https://hal.archives-ouvertes.fr/hal-01511145
Joint training of a convolutional network and a graphical model for human pose estimation, NIPS, vol.54, p.55, 2014. ,
DeepPose: Human pose estimation via deep neural networks, CVPR, vol.33, p.137, 2014. ,
DOI : 10.1109/cvpr.2014.214
URL : http://arxiv.org/pdf/1312.4659
Learning spatiotemporal features with 3D convolutional networks, p.33, 2015. ,
DOI : 10.1109/iccv.2015.510
URL : http://arxiv.org/pdf/1412.0767
Selective search for object recognition, vol.38, p.111, 2013. ,
DOI : 10.1007/s11263-013-0620-5
URL : https://pure.uva.nl/ws/files/19494140/UijlingsIJCV2013.pdf
APT: Action localization proposals from dense trajectories, BMVC, vol.38, p.111, 2015. ,
Long-term temporal convolutions for action recognition, TPAMI, vol.33, p.76, 2017. ,
DOI : 10.1109/tpami.2017.2712608
URL : https://hal.archives-ouvertes.fr/hal-01241518
Show and tell: A neural image caption generator, 2015. ,
DOI : 10.1109/cvpr.2015.7298935
URL : http://arxiv.org/pdf/1411.4555
Action recognition by dense trajectories, CVPR, vol.30, p.135, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00583818
Dense trajectories and motion boundary descriptors for action recognition, IJCV, vol.29, p.61, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00725627
Action recognition with improved trajectories, ICCV, vol.27, p.66, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00873267
UntrimmedNets for Weakly Supervised Action Recognition and Detection, CVPR, vol.43, p.112, 2017. ,
Convolutional pose machines, CVPR, vol.33, p.137, 2016. ,
Learning to track for spatiotemporal action localization, ICCV, vol.38, p.111, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01159941
Human action localization with sparse spatial supervision, vol.127, p.136, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01317558
Deepflow: Large displacement optical flow with deep matching, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00873592
Can humans fly? action understanding with multiple classes of actors, 2015. ,
Maximum margin clustering, NIPS. 43, vol.47, p.114, 2004. ,
Common action discovery and localization in unconstrained videos, ICCV, vol.44, p.112, 2017. ,
Articulated pose estimation with flexible mixtures-ofparts, CVPR, vol.35, p.65, 2011. ,
End-to-end learning of action detection from frame glimpses in videos, vol.74, p.76, 2016. ,
Temporal action localization with pyramid of score distribution features, CVPR, vol.38, p.76, 2016. ,
Beyond short snippets: Deep networks for video classification, CVPR, vol.33, p.55, 2015. ,
A duality based approach for realtime tv-l 1 optical flow, Joint Pattern Recognition Symposium, p.29, 2007. ,
Temporal action detection with structured segment networks, 2017. ,
Interaction part mining: A mid-level approach for fine-grained action recognition, vol.67, p.68, 2015. ,
Pipelining localized semantic features for fine-grained action recognition, 2014. ,
Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, vol.99, p.111, 2017. ,