PoTion : Pose MoTion Representation for Action Recognition
Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
Inria, NAVER LABS Europe
Abstract
- State-of-the-art methods for action recognition rely on two-stream networks
- Try to process appearance and motion jointly
- Fixed-sized representation that encodes pose over clip
1. Introduction
- Human pose can be added to multi-stream architecture
- 3D skeleton
- Limited to case where the data is available
- 2D poses
- Used when fully-visible
- Features(hand-crafted, CNN) around the human joints
- Zolfaghari et al.
- Pose stream for semantic segmentation maps
- FC + spatio-temporal CNN
- Propose to focus on the movement of a few relevant keypoints
- Fixed-sized representation, does not depend on the duration of the clip.
- Overview of PoTion
- Obtain heat maps for human joint by running pose estimator (Part Affinity Fields).
- Colorize heat maps depending on the relevant time.
- Obtain PoTion representation for clip.
- Train shallow CNN architecture for action classification.
- Contributions
- Clip-level representation that encodes human pose motion, PoTion
- Study PoTion representation and CNN for action recognition
- Combin PoTion with two-stream architecture
2. Related Work
- CNNs for action recognition
- 2D + RNN
- 3D CNN
- Two-stream
- Motion representation
- Optical flow, etc.
- Pose representation
- 3D skeleton
- 2D pose
- Approach of Zolfaghari et al.
3. PoTion representation
3.1. Extracting joint heatmaps
- Obtain human joint heatmaps by Part Affinity Fields
- Part Affinity Fields
- Robust to occlusion and truncation
- Output : joint heatmap & affinity map
- Use only joint heatmap
Is the likelihood of pixel
3.2. Time-dependent heat map colorization
- ‘Colorize’
according to relative time of this frame to
- For C = 2
- For multiple channels C
- Split T frames into C-1 regularly sampled intervals
3.3. Aggregation of colorized heatmaps
- Compute sum of the colorized heat maps
- Obtain
invariant representation by normalizing channel
independently.
- Compute intensity image
- Encodes how much time a joint stays at each location
- Normalized PoTion representation : divide
by intensity
- All locations of the motion trajectory are weighted eqally
4. CNN on PoTion representation
Network architecture
- 3 blocks with 2 convolutional layers in each block
- 3 x 3 kernel , stride 2 and then stride 1
- Global average pooling + FC + softmax after all blocks
- Batch normalization, ReLU after each Conv.
Implementation details
- Xavier initialization and train from scratch
- Dropout of 0.25
- Adam optimizer
- Batch size of 32
- 4 hours on single GPU (Titan X)
- Data augmentation
- Flipping by swapping channels was efficient
- Random cropping, smoothing heat maps, shifting did not gain
5. Experimental results
5.1. Datasets and metrics
- HMDB
- JHMDB
- UCF101
- Kinetics
- Report mean classification accuracy.
5.2 PoTion representation
- Number of channels
- Aggregation techniques
5.3. CNN on PoTion
- Data augmentation
- Network Architecture
- Number of convolution layers per block : 2
- Number of blocks : 3
- Number of convolution filters : (128, 256, 512)
5.4. Imapct of pose estimation
- Groundtruth pose : 4% gain
- Crop frames centered on the actor (GT-JHMDB) : 6% gain
5.5. Comparison to the state of the art
- Multi-stream approach
- Up to +8% on TSN / Up to 3% on I3D
- Comparison to Zolfahari et al.
- Improvement due to improved pose motion representation
- Comparison to the state of the art
- Outperform all existing approaches
- Detailed analysis
- Clear improvement when well defined motion
- Low performance when object is more important than motion
- Results on Kinetics
- Accuracy decreased by 1~2%
- Reasons of decrease
- Actors are partially visible
- Feature erratic camera
- Multiple shots per clip
6. Conclusion
- PoTion encodes the motion over entire clip into fixed-size representation.
- Classification with PoTion representation and shallow CNN.
- Leads to state-of-the-art performance on JHMDB, HMDB, and UCF-101.
Our discussion
- Why not use 3D convolution for capturing temporal features?