. Weinzaepfel, for all types of supervision on UCF101-24. These results are presented in Table 6.2 (our w. tracks of, 2016.

. Weinzaepfel, we observe only a small difference in performance between the "Temporal" and the "Temporal + 1 BB" setups. This suggests an interesting conclusion that the spatial supervision for action localization may not always be necessary. Back to Table 6.1, note that, even if not the main focus of our work, our performance on the fully supervised setting is on par with the recent work, vol.37, p.4, 2016.

, This consolidates the conclusions drawn above about the alternative way of annotating videos ('temporal click') and what information is important (is the spatial supervision always necessary?). However, we are still below the current state of the art

J. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic et al., Unsupervised learning from narrated instruction videos, CVPR. 43, vol.44, p.139, 2016.
DOI : 10.1109/cvpr.2016.495
URL : https://hal.archives-ouvertes.fr/hal-01171193

J. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-julien, Joint Discovery of Object States and Manipulation Actions, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01676084

M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, 2d human pose estimation: New benchmark and state of the art analysis, 2014.

P. Arbeláez, J. Pont-tuset, J. T. Barron, F. Marques, M. et al., Multiscale combinatorial grouping, CVPR, vol.38, p.111, 2014.

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Sequential deep learning for human action recognition, International Workshop on Human Behavior Understanding, p.34, 2011.
DOI : 10.1007/978-3-642-25446-8_4
URL : https://hal.archives-ouvertes.fr/hal-01354493

F. Bach and Z. Harchaoui, DIFFRAC: A discriminative and flexible framework for clustering, NIPS. 42, vol.47, p.117, 2007.

T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, Social scene understanding: End-to-end multi-person action localization and collective activity recognition, CVPR, vol.38, p.76, 2017.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014.

H. Bay, A. Ess, T. Tuytelaars, and L. Van-gool, Speeded-up robust features (SURF), Comput. Vis. Image Underst, p.61, 2008.

H. Bay, T. Tuytelaars, and L. Van-gool, Surf: Speeded up robust features, ECCV, p.27, 2006.

T. Berg and P. N. Belhumeur, Poof: Part-based one-vs.-one features for finegrained categorization, face verification, and attribute estimation, 2013.

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, T. et al., Fullyconvolutional siamese networks for object tracking, BMVC, p.86, 2016.

P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding actors and actions in movies, ICCV. 42, vol.48, p.139, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00904991

P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce et al., Weakly supervised action labeling in videos under ordering constraints, ECCV. 43, vol.49, p.141, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01053967

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-supervised alignment of video with text, ICCV, p.43, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01154523

L. Bottou, Online learning and stochastic approximations. On-line learning in neural networks, p.33, 1998.

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High accuracy optical flow estimation based on a theory for warping, ECCV, vol.29, p.84, 2004.

S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles, SST: Singlestream temporal action proposals, 2017.

F. Caba-heilbron, J. Carlos-niebles, and B. Ghanem, Fast temporal activity proposals for efficient detection of human actions in untrimmed videos, 2016.

M. Caron, P. Bojanowski, A. Joulin, D. , and M. , Deep clustering for unsupervised learning of visual features, 2018.

J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the Kinetics dataset, CVPR, vol.34, p.118, 2017.

K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, BMVC, vol.59, p.61, 2011.

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, Return of the devil in the details: Delving deep into convolutional nets, 2014.

W. Chen and J. J. Corso, Action Detection by implicit intentional Motion Clustering, ICCV, vol.44, p.112, 2015.

X. Chen and A. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, NIPS. 33, vol.54, p.137, 2014.

A. Cherian, J. Mairal, K. Alahari, and C. Schmid, Mixing body-part sequences for human pose estimation, CVPR, vol.35, p.70, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00978643

G. Chéron, J. Alayrac, I. Laptev, and C. Schmid, A flexible model for training action localization with varying levels of supervision, NIPS. 20, vol.24, p.140, 2018.

G. Chéron, I. Laptev, and C. Schmid, , 2015.

G. Chéron, I. Laptev, and C. Schmid, P-CNN: Pose-based CNN features for action recognition, ICCV. 20, vol.21, p.138, 2015.

G. Chéron, A. Osokin, I. Laptev, and C. Schmid, Modeling spatio-temporal human track structure for action localization, vol.20, p.23, 2018.

N. Chesneau, G. Rogez, K. Alahari, and C. Schmid, Detecting parts for action localization, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01573629

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning phrase representations using rnn encoder-decoder for statistical machine translation, vol.75, p.80, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01433235

M. Cho, S. Kwak, C. Schmid, and J. Ponce, Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01110036

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014.

C. Cortes and V. Vapnik, Support-vector networks. Machine learning, p.30, 1995.

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, 2004.

O. G. Cula and K. J. Dana, Compact representation of bidirectional texture functions, 2001.

G. E. Dahl, D. Yu, L. Deng, A. , and A. , Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol.73, p.79, 2012.

N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, CVPR, vol.27, p.61, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548512

N. Dalal, B. Triggs, and C. Schmid, Human detection using oriented histograms of flow and appearance, ECCV, vol.29, p.61, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00548587

C. R. De-souza, A. Gaidon, E. Vig, and A. M. López, Sympathy for the details: dense trajectories and hybrid classification architectures for action recognition, vol.33, p.76, 2016.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, vol.57, p.137, 2009.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, vol.71, p.74, 2015.

M. Douze and H. Jégou, The Yael library, ACM Multimedia, p.61, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01020695

K. Duan, D. Parikh, D. Crandall, G. , and K. , Discovering localized attributes for fine-grained recognition, 2012.

O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic annotation of human actions in video, vol.112, p.139, 2009.

V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, Daps: Deep action proposals for action understanding, 2016.

G. Farnebäck, Two-frame motion estimation based on polynomial expansion, SCIA, p.61, 2003.

L. Fei-fei and P. Perona, A bayesian hierarchical model for learning natural scene categories, 2005.

C. Feichtenhofer, A. Pinz, and R. P. Wildes, Spatiotemporal residual networks for video action recognition, NIPS, vol.33, p.76, 2016.

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition, CVPR, vol.33, p.76, 2016.

M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol.30, p.61, 1981.

M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly, vol.50, p.118, 1956.

A. Gaidon, Z. Harchaoui, and C. Schmid, Actom sequence models for efficient action detection, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00575217

R. Girshick, Fast r-cnn, ICCV, vol.75, p.85, 2015.

R. Girshick, J. Donahue, T. Darrell, M. , and J. , Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR, vol.33, p.34, 2014.

R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, , vol.112, p.118, 2018.

G. Gkioxari and J. Malik, Finding action tubes, CVPR. 38, vol.57, p.111, 2015.

A. Gorban, H. Idrees, Y. Jiang, A. R. Roshan-zamir, I. Laptev et al., THUMOS challenge: Action recognition with a large number of classes, p.76, 2015.

A. Graves, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, 2013.

C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross et al., AVA: A video dataset of spatiotemporally localized atomic visual actions, CVPR. 39, 98, 99, vol.100, p.127, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01764300

C. Harris and M. Stephens, A combined corner and edge detector, Alvey vision conference, p.27, 1988.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, vol.97, p.118, 2016.

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, vol.17, p.40, 2015.

M. Hoai, Z. Lan, and F. De-la-torre, Joint segmentation and classification of human actions in video, 2011.

S. Hochreiter and J. Schmidhuber, Long short-term memory, 1997.

B. K. Horn and B. G. Schunck, Determining optical flow, In Artificial intelligence, vol.29, 1981.
DOI : 10.1016/0004-3702(81)90024-2

R. Hou, C. Chen, and M. Shah, Tube convolutional neural network (T-CNN) for action detection in videos, ICCV, vol.39, p.111, 2017.

D. Huang, L. Fei-fei, and J. C. Niebles, Connectionist Temporal Modeling for Weakly Supervised Action Labeling, ECCV. 43, vol.112, p.141, 2016.
DOI : 10.1007/978-3-319-46493-0_9
URL : http://arxiv.org/pdf/1607.08584

H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev et al., The THUMOS challenge on action recognition for videos "in the wild, Computer Vision and Image Understanding, p.40, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01431525

M. Jain, J. Van-gemert, H. Jégou, P. Bouthemy, and C. G. Snoek, Action localization with tubelets from motion, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00996844

H. Jegou, F. Perronnin, M. Douze, J. Sánchez, P. Perez et al., Aggregating local image descriptors into compact codes, 2012.
DOI : 10.1109/tpami.2011.235
URL : https://hal.archives-ouvertes.fr/inria-00633013

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards understanding action recognition, ICCV. 35, vol.36, p.137, 2013.
DOI : 10.1109/iccv.2013.396
URL : https://hal.archives-ouvertes.fr/hal-00906902

S. Ji, W. Xu, M. Yang, Y. , and K. , 3D convolutional neural networks for human action recognition, p.33, 2010.

A. Joulin, F. Bach, and J. Ponce, Discriminative clustering for image cosegmentation, CVPR, vol.48, p.113, 2010.
DOI : 10.1109/cvpr.2010.5539868
URL : http://www.di.ens.fr/%7Efbach/cosegmentation_cvpr2010.pdf

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Action tubelet detector for spatio-temporal action localization, ICCV. 39, 40, 75, vol.126, p.136, 2017.
DOI : 10.1109/iccv.2017.472
URL : https://hal.archives-ouvertes.fr/hal-01519812

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Joint learning of object and action detectors, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01575804

V. Kantorov, M. Oquab, M. Cho, and I. Laptev, Contextlocnet: Context-aware deep network models for weakly supervised localization, 2016.
DOI : 10.1007/978-3-319-46454-1_22
URL : https://hal.archives-ouvertes.fr/hal-01421772

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, CVPR, vol.74, p.79, 2015.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., , 2014.

, Large-scale video classification with convolutional neural networks

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The kinetics human action video dataset, vol.16, p.137, 2017.

Y. Ke, R. Sukthankar, H. , and M. , Efficient visual event detection using volumetric features, ICCV, vol.38, p.111, 2005.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

A. Klaser, M. Marsza?ek, and C. Schmid, A spatio-temporal descriptor based on 3d-gradients, BMVC, p.29, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00514853

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, vol.32, p.57, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: a large video database for human motion recognition, ICCV. 13, vol.17, p.54, 2011.

S. Lacoste-julien, M. Jaggi, M. Schmidt, and P. Pletscher, Block-coordinate Frank-Wolfe optimization for structural SVMs, ICML. 51, vol.113, p.118, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720158

I. Laptev, On space-time interest points, IJCV, vol.27, p.28, 2005.

I. Laptev, Modeling and visual recognition of human actions and interactions. Habilitation à diriger des recherches, 2013.
URL : https://hal.archives-ouvertes.fr/tel-01064540

I. Laptev, B. Caputo, C. Schüldt, and T. Lindeberg, Local velocity-adapted motion events for spatio-temporal recognition. Computer vision and image understanding, vol.27, p.28, 2007.

I. Laptev and T. Lindeberg, Local descriptors for spatio-temporal recognition, ECCV workshop, p.27, 2004.

I. Laptev, M. Marsza?ek, C. Schmid, R. , and B. , Learning realistic human actions from movies, CVPR, vol.55, p.61, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00548659

I. Laptev and P. Pérez, Retrieving actions in movies, ICCV. 29, vol.38, p.111, 2007.

S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, CVPR, vol.29, p.30, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00548585

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.33, p.54, 1998.

Y. Lecun, L. Bottou, G. B. Orr, and K. Müller, Efficient backprop, Neural networks: Tricks of the trade, vol.33, 1998.

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft coco: Common objects in context, 2014.

J. Liu, A. Shahroudy, D. Xu, W. , and G. , Spatio-temporal LSTM with trust gates for 3D human action recognition, 2016.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed et al., Ssd: Single shot multibox detector, vol.76, p.97, 2016.

D. G. Lowe, Object recognition from local scale-invariant features, ICCV, p.27, 1999.

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual relationship detection with language priors, 2016.

B. D. Lucas and T. Kanade, An iterative image registration technique with an application to stereo vision, IJCAI, vol.29, p.118, 1981.

S. Ma, L. Sigal, and S. Sclaroff, Learning activity progression in LSTMs for activity detection and early detection, CVPR, vol.38, p.76, 2016.

M. Marsza?ek, C. Schmid, H. Harzallah, . Van-de, and J. Weijer, Learning object representations for visual object class recognition, 2007.

J. Mathe, , 2012.

M. Mathias, R. Benenson, M. Pedersoli, and L. Van-gool, Face detection without bells and whistles, 2014.

P. Mettes, C. G. Snoek, C. , and S. , Localizing Actions from Video labels and Pseudo-Annotations, BMVC, vol.44, p.126, 2017.

P. Mettes, J. C. Van-gemert, and C. G. Snoek, Spot on: Action localization from pointly-supervised proposals, ECCV. 45, 99, vol.110, p.125, 2016.

A. Miech, J. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic, Learning from video and text via large-scale discriminative clustering, vol.113, p.118, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01569540

A. Miech, I. Laptev, and J. Sivic, Learning a text-video embedding from incomplete and heterogeneous data, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01975102

K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, p.27, 2005.
URL : https://hal.archives-ouvertes.fr/inria-00548227

A. Newell, K. Yang, and J. Deng, Stacked hourglass networks for human pose estimation, ECCV. 33, vol.54, p.137, 2016.

J. Y. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga et al., Beyond short snippets: Deep networks for video classification, CVPR, vol.34, p.91, 2015.

B. Ni, V. R. Paramathayalan, and P. Moulin, Multiple granularity analysis for fine-grained action detection, 2014.

E. Nowak, F. Jurie, and B. Triggs, Sampling strategies for bag-of-features image classification, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00203752

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid, Spatio-temporal object detection proposals, ECCV, vol.38, p.111, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01021902

D. Oneata, J. Verbeek, and C. Schmid, Action and event recognition with fisher vectors on a compact feature set, ICCV, vol.30, p.66, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873662

M. Oquab, L. Bottou, I. Laptev, and J. Sivic, Learning and transferring midlevel image representations using convolutional neural networks, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00911179

A. Osokin, J. Alayrac, I. Lukasewitz, P. K. Dokania, and S. Lacoste-julien, Minding the gaps for block Frank-Wolfe optimization of structured SVMs, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01323727

X. Peng and C. Schmid, Multi-region two-stream R-CNN for action detection, HAL. 38, 75, vol.76, p.111, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01349107

X. Peng, C. Zou, Y. Qiao, and Q. Peng, Action recognition with stacked fisher vectors, 2014.

F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, Large-scale image retrieval with compressed fisher vectors, 2010.

F. Perronnin, J. Sánchez, and T. Mensink, Improving the fisher kernel for large-scale image classification, ECCV, vol.30, p.61, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548630

J. Peyre, I. Laptev, C. Schmid, and J. Sivic, Weakly-supervised learning of visual relations, vol.17, p.138, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01576035

L. Pigou, A. Van-den-oord, S. Dieleman, M. Van-herreweghe, D. et al., Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, IJCV, vol.34, p.76, 2017.

H. Pirsiavash and D. Ramanan, Parsing videos of actions with segmental grammars, 2014.

L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, Poselet conditioned pictorial structures, CVPR, vol.35, p.55, 2013.

A. Prest, V. Ferrari, and C. Schmid, Explicit modeling of human-object interactions in realistic videos, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00720847

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, NIPS, vol.85, p.118, 2015.

A. Richard, H. Kuehne, and J. Gall, Weakly supervised action learning with rnn based fine-to-coarse modeling, CVPR. 43, vol.112, p.141, 2017.

M. D. Rodriguez, J. Ahmed, and M. Shah, Action mach a spatio-temporal maximum average correlation height filter for action recognition, 2008.

M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, A database for fine grained activity detection of cooking activities, CVPR. 13, vol.36, p.68, 2012.

M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka et al., Recognizing fine-grained and composite activities using hand-centric features and script data, 2016.

D. E. Rumelhart, G. E. Hinton, W. , and R. J. , Learning internal representations by error propagation, Parallel distributed processing, p.33, 1985.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet large scale visual recognition challenge, IJCV, vol.32, p.118, 2015.

S. Saha, G. Singh, C. , and F. , Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture, ICCV, vol.39, p.111, 2017.

S. Saha, G. Singh, M. Sapienza, P. H. Torr, C. et al., Deep learning for detecting multiple space-time action tubes in videos, BMVC. 38, 39, 41, 75, vol.76, p.136, 2016.

B. Sapp and B. Taskar, Modec: Multimodal decomposable models for human pose estimation, CVPR, vol.35, p.59, 2013.

B. Sapp, D. Weiss, and B. Taskar, Parsing human motion with stretchable models, CVPR, vol.35, p.55, 2011.

C. Schmid and R. Mohr, Local grayvalue invariants for image retrieval, TPAMI, p.27, 1997.
URL : https://hal.archives-ouvertes.fr/inria-00548358

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, ICPR, vol.27, p.55, 2004.

Z. Shou, D. Wang, C. , and S. , Temporal action localization in untrimmed videos via multi-stage CNNs, CVPR, vol.38, p.139, 2016.

G. A. Sigurdsson, S. K. Divvala, A. Farhadi, and A. Gupta, Asynchronous temporal fields for action recognition, 2017.

K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman, Fisher vector faces in the wild, BMVC, vol.29, p.31, 2013.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS. 33, vol.34, p.73, 2014.

B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, A multi-stream bi-directional recurrent neural network for fine-grained action detection, CVPR, vol.34, p.76, 2016.

G. Singh, S. Saha, M. Sapienza, P. Torr, C. et al., Online real time multiple spatiotemporal action localisation and prediction, vol.111, p.118, 2017.
DOI : 10.1109/iccv.2017.393
URL : http://arxiv.org/pdf/1611.08563

K. K. Singh and Y. J. Lee, Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization, ICCV, vol.43, p.112, 2017.

P. Siva and T. Xiang, Weakly Supervised Action Detection, BMVC, vol.44, p.112, 2011.
DOI : 10.5244/c.25.65
URL : http://www.bmva.org/bmvc/2011/proceedings/paper65/paper65.pdf

J. Sivic and A. Zisserman, Video google: A text retrieval approach to object matching in videos, ICCV, p.29, 2003.

K. Soomro and M. Shah, Unsupervised action discovery and localization in videos, ICCV, vol.44, p.112, 2017.

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, vol.17, p.137, 2012.

N. Srivastava, E. Mansimov, and R. Salakhutdinov, Unsupervised learning of video representations using LSTMs, 2016.

I. Sutskever, J. Martens, and G. E. Hinton, Generating text with recurrent neural networks, 2011.

Y. Taigman, M. Yang, M. Ranzato, W. , and L. , Deepface: Closing the gap to human-level performance in face verification, 2014.

E. H. Taralova, F. De-la-torre, H. , and M. , Motion words for videos, 2014.

G. W. Taylor, R. Fergus, Y. Lecun, and C. Bregler, Convolutional learning of spatio-temporal features, p.33, 2010.

P. Tokmakov, K. Alahari, and C. Schmid, Learning video object segmentation with visual memory, ICCV, vol.80, p.89, 2017.
DOI : 10.1109/iccv.2017.480
URL : https://hal.archives-ouvertes.fr/hal-01511145

J. J. Tompson, A. Jain, Y. Lecun, and C. Bregler, Joint training of a convolutional network and a graphical model for human pose estimation, NIPS, vol.54, p.55, 2014.

A. Toshev and C. Szegedy, DeepPose: Human pose estimation via deep neural networks, CVPR, vol.33, p.137, 2014.
DOI : 10.1109/cvpr.2014.214
URL : http://arxiv.org/pdf/1312.4659

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3D convolutional networks, p.33, 2015.
DOI : 10.1109/iccv.2015.510
URL : http://arxiv.org/pdf/1412.0767

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective search for object recognition, vol.38, p.111, 2013.
DOI : 10.1007/s11263-013-0620-5
URL : https://pure.uva.nl/ws/files/19494140/UijlingsIJCV2013.pdf

J. C. Van-gemert, M. Jain, E. Gati, and C. G. Snoek, APT: Action localization proposals from dense trajectories, BMVC, vol.38, p.111, 2015.

G. Varol, I. Laptev, and C. Schmid, Long-term temporal convolutions for action recognition, TPAMI, vol.33, p.76, 2017.
DOI : 10.1109/tpami.2017.2712608
URL : https://hal.archives-ouvertes.fr/hal-01241518

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015.
DOI : 10.1109/cvpr.2015.7298935
URL : http://arxiv.org/pdf/1411.4555

H. Wang, A. Kläser, C. Schmid, and C. Liu, Action recognition by dense trajectories, CVPR, vol.30, p.135, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00583818

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense trajectories and motion boundary descriptors for action recognition, IJCV, vol.29, p.61, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00725627

H. Wang and C. Schmid, Action recognition with improved trajectories, ICCV, vol.27, p.66, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873267

L. Wang, Y. Xiong, D. Lin, and L. V. Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, CVPR, vol.43, p.112, 2017.

S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, Convolutional pose machines, CVPR, vol.33, p.137, 2016.

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to track for spatiotemporal action localization, ICCV, vol.38, p.111, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01159941

P. Weinzaepfel, X. Martin, and C. Schmid, Human action localization with sparse spatial supervision, vol.127, p.136, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01317558

P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, Deepflow: Large displacement optical flow with deep matching, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00873592

C. Xu, S. Hsieh, C. Xiong, and J. J. Corso, Can humans fly? action understanding with multiple classes of actors, 2015.

L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, Maximum margin clustering, NIPS. 43, vol.47, p.114, 2004.

J. Yang and J. Yuan, Common action discovery and localization in unconstrained videos, ICCV, vol.44, p.112, 2017.

Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-ofparts, CVPR, vol.35, p.65, 2011.

S. Yeung, O. Russakovsky, G. Mori, and L. Fei-fei, End-to-end learning of action detection from frame glimpses in videos, vol.74, p.76, 2016.

J. Yuan, B. Ni, X. Yang, A. Kassim, and A. , Temporal action localization with pyramid of score distribution features, CVPR, vol.38, p.76, 2016.

N. J. Yue-hei, H. Matthew, V. Sudheendra, V. Oriol, M. Rajat et al., Beyond short snippets: Deep networks for video classification, CVPR, vol.33, p.55, 2015.

C. Zach, T. Pock, and H. Bischof, A duality based approach for realtime tv-l 1 optical flow, Joint Pattern Recognition Symposium, p.29, 2007.

Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang et al., Temporal action detection with structured segment networks, 2017.

Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian, Interaction part mining: A mid-level approach for fine-grained action recognition, vol.67, p.68, 2015.

Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian, Pipelining localized semantic features for fine-grained action recognition, 2014.

M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, vol.99, p.111, 2017.