PoTion : Pose MoTion Representation for Action Recognition

PoTion : Pose MoTion Representation for Action Recognition

Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
Inria, NAVER LABS Europe



Abstract

  • State-of-the-art methods for action recognition rely on two-stream networks
  • Try to process appearance and motion jointly
  • Fixed-sized representation that encodes pose over clip

Alt text

1. Introduction

  • Human pose can be added to multi-stream architecture
  • 3D skeleton
    • Limited to case where the data is available
  • 2D poses
    • Used when fully-visible
    • Features(hand-crafted, CNN) around the human joints
  • Zolfaghari et al.
    • Pose stream for semantic segmentation maps
    • FC + spatio-temporal CNN
  • Propose to focus on the movement of a few relevant keypoints
    • Fixed-sized representation, does not depend on the duration of the clip.
  • Overview of PoTion
    • Obtain heat maps for human joint by running pose estimator (Part Affinity Fields).
    • Colorize heat maps depending on the relevant time.
    • Obtain PoTion representation for clip.
    • Train shallow CNN architecture for action classification.
  • Contributions
    • Clip-level representation that encodes human pose motion, PoTion
    • Study PoTion representation and CNN for action recognition
    • Combin PoTion with two-stream architecture
  • CNNs for action recognition
    • 2D + RNN
    • 3D CNN
    • Two-stream
  • Motion representation
    • Optical flow, etc.
  • Pose representation
    • 3D skeleton
    • 2D pose
    • Approach of Zolfaghari et al.

3. PoTion representation

3.1. Extracting joint heatmaps

  • Obtain human joint heatmaps by Part Affinity Fields
  • Part Affinity Fields
    • Robust to occlusion and truncation
    • Output : joint heatmap & affinity map
    • Use only joint heatmap
  • Is the likelihood of pixel

3.2. Time-dependent heat map colorization

Alt text

  • ‘Colorize’ according to relative time of this frame to
  • For C = 2
  • For multiple channels C
    • Split T frames into C-1 regularly sampled intervals

3.3. Aggregation of colorized heatmaps

Alt text

  • Compute sum of the colorized heat maps
  • Obtain invariant representation by normalizing channel independently.
  • Compute intensity image
    • Encodes how much time a joint stays at each location
  • Normalized PoTion representation : divide by intensity
    • All locations of the motion trajectory are weighted eqally

4. CNN on PoTion representation

Alt text

  • Network architecture

    • 3 blocks with 2 convolutional layers in each block
    • 3 x 3 kernel , stride 2 and then stride 1
    • Global average pooling + FC + softmax after all blocks
    • Batch normalization, ReLU after each Conv.
  • Implementation details

    • Xavier initialization and train from scratch
    • Dropout of 0.25
    • Adam optimizer
    • Batch size of 32
    • 4 hours on single GPU (Titan X)
    • Data augmentation
      • Flipping by swapping channels was efficient
      • Random cropping, smoothing heat maps, shifting did not gain

5. Experimental results

Alt text

5.1. Datasets and metrics

  • HMDB
  • JHMDB
  • UCF101
  • Kinetics
  • Report mean classification accuracy.

5.2 PoTion representation

  • Number of channels
  • Aggregation techniques

5.3. CNN on PoTion

  • Data augmentation
  • Network Architecture
    • Number of convolution layers per block : 2
    • Number of blocks : 3
    • Number of convolution filters : (128, 256, 512)

5.4. Imapct of pose estimation

Alt text

  • Groundtruth pose : 4% gain
  • Crop frames centered on the actor (GT-JHMDB) : 6% gain

5.5. Comparison to the state of the art

Alt text

  • Multi-stream approach
    • Up to +8% on TSN / Up to 3% on I3D
  • Comparison to Zolfahari et al.
    • Improvement due to improved pose motion representation
  • Comparison to the state of the art
    • Outperform all existing approaches
  • Detailed analysis
    • Clear improvement when well defined motion
    • Low performance when object is more important than motion
  • Results on Kinetics
    • Accuracy decreased by 1~2%
    • Reasons of decrease
      • Actors are partially visible
      • Feature erratic camera
      • Multiple shots per clip

6. Conclusion

  • PoTion encodes the motion over entire clip into fixed-size representation.
  • Classification with PoTion representation and shallow CNN.
  • Leads to state-of-the-art performance on JHMDB, HMDB, and UCF-101.

Our discussion

  • Why not use 3D convolution for capturing temporal features?

+ Recent posts