[논문 / Action Recognition] PoTion : Pose MoTion Representation for Action Recognition

2018. 9. 7. 13:54

PoTion : Pose MoTion Representation for Action Recognition

Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
Inria, NAVER LABS Europe

PoTion : Pose MoTion Representation for Action Recognition

Abstract

State-of-the-art methods for action recognition rely on two-stream networks
Try to process appearance and motion jointly
Fixed-sized representation that encodes pose over clip

Alt text

1. Introduction

Human pose can be added to multi-stream architecture
3D skeleton
- Limited to case where the data is available
2D poses
- Used when fully-visible
- Features(hand-crafted, CNN) around the human joints
Zolfaghari et al.
- Pose stream for semantic segmentation maps
- FC + spatio-temporal CNN
Propose to focus on the movement of a few relevant keypoints
- Fixed-sized representation, does not depend on the duration of the clip.
Overview of PoTion
- Obtain heat maps for human joint by running pose estimator (Part Affinity Fields).
- Colorize heat maps depending on the relevant time.
- Obtain PoTion representation for clip.
- Train shallow CNN architecture for action classification.
Contributions
- Clip-level representation that encodes human pose motion, PoTion
- Study PoTion representation and CNN for action recognition
- Combin PoTion with two-stream architecture

CNNs for action recognition
- 2D + RNN
- 3D CNN
- Two-stream
Motion representation
- Optical flow, etc.
Pose representation
- 3D skeleton
- 2D pose
- Approach of Zolfaghari et al.

3. PoTion representation

3.1. Extracting joint heatmaps

Obtain human joint heatmaps by Part Affinity Fields
Part Affinity Fields
- Robust to occlusion and truncation
- Output : joint heatmap & affinity map
- Use only joint heatmap
Is the likelihood of pixel

3.2. Time-dependent heat map colorization

Alt text

‘Colorize’ according to relative time of this frame to
For C = 2
For multiple channels C
- Split T frames into C-1 regularly sampled intervals

3.3. Aggregation of colorized heatmaps

Alt text

Compute sum of the colorized heat maps
Obtain invariant representation by normalizing channel independently.
Compute intensity image
- Encodes how much time a joint stays at each location
Normalized PoTion representation : divide by intensity
- All locations of the motion trajectory are weighted eqally

4. CNN on PoTion representation

Alt text

Network architecture
- 3 blocks with 2 convolutional layers in each block
- 3 x 3 kernel , stride 2 and then stride 1
- Global average pooling + FC + softmax after all blocks
- Batch normalization, ReLU after each Conv.
Implementation details
- Xavier initialization and train from scratch
- Dropout of 0.25
- Adam optimizer
- Batch size of 32
- 4 hours on single GPU (Titan X)
- Data augmentation
  - Flipping by swapping channels was efficient
  - Random cropping, smoothing heat maps, shifting did not gain

5. Experimental results

Alt text

5.1. Datasets and metrics

HMDB
JHMDB
UCF101
Kinetics
Report mean classification accuracy.

5.2 PoTion representation

Number of channels
Aggregation techniques

5.3. CNN on PoTion

Data augmentation
Network Architecture
- Number of convolution layers per block : 2
- Number of blocks : 3
- Number of convolution filters : (128, 256, 512)

5.4. Imapct of pose estimation

Alt text

Groundtruth pose : 4% gain
Crop frames centered on the actor (GT-JHMDB) : 6% gain

5.5. Comparison to the state of the art

Alt text

Multi-stream approach
- Up to +8% on TSN / Up to 3% on I3D
Comparison to Zolfahari et al.
- Improvement due to improved pose motion representation
Comparison to the state of the art
- Outperform all existing approaches
Detailed analysis
- Clear improvement when well defined motion
- Low performance when object is more important than motion
Results on Kinetics
- Accuracy decreased by 1~2%
- Reasons of decrease
  - Actors are partially visible
  - Feature erratic camera
  - Multiple shots per clip

6. Conclusion

PoTion encodes the motion over entire clip into fixed-size representation.
Classification with PoTion representation and shallow CNN.
Leads to state-of-the-art performance on JHMDB, HMDB, and UCF-101.

Our discussion

Why not use 3D convolution for capturing temporal features?

저작자표시 (새창열림)

'Computer Science' 카테고리의 다른 글

[논문/Action Recognition] Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition (0)	2018.09.19
[논문/Action Recognition] I3D와 Kinetics dataset (2)	2018.08.22
MacOS에 OpenCV 3.3.0 설치하려고 한 후기 (0)	2018.08.22

오늘도 리뷰🖋

[논문 / Action Recognition] PoTion : Pose MoTion Representation for Action Recognition

PoTion : Pose MoTion Representation for Action Recognition

Abstract

1. Introduction

3. PoTion representation

3.1. Extracting joint heatmaps

3.2. Time-dependent heat map colorization

3.3. Aggregation of colorized heatmaps

4. CNN on PoTion representation

5. Experimental results

5.1. Datasets and metrics

5.2 PoTion representation

5.3. CNN on PoTion

5.4. Imapct of pose estimation

5.5. Comparison to the state of the art

6. Conclusion

Our discussion

'Computer Science' 카테고리의 다른 글

+ Recent posts

티스토리툴바

오늘도 리뷰🖋

[논문 / Action Recognition] PoTion : Pose MoTion Representation for Action Recognition

PoTion : Pose MoTion Representation for Action Recognition

Abstract

1. Introduction

2. Related Work

3. PoTion representation

3.1. Extracting joint heatmaps

3.2. Time-dependent heat map colorization

3.3. Aggregation of colorized heatmaps

4. CNN on PoTion representation

5. Experimental results

5.1. Datasets and metrics

5.2 PoTion representation

5.3. CNN on PoTion

5.4. Imapct of pose estimation

5.5. Comparison to the state of the art

6. Conclusion

Our discussion

'Computer Science' 카테고리의 다른 글

+ Recent posts

티스토리툴바