Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition


Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, Wei Zhang
The University of Sydney, SenseTime Research, The Chinese University of Hong Kong



Abstract

  • Novel compact motion representation method, named Optical Flow guided Feature (OFF)
  • OFF can be embedded in any framework.

Alt text

1. Introduction

  • Temporal information is the key.
  • Optical flow is useful motion representation, but inefficient.
  • 3D CNN does not perform as well as Two-stream networks with optical flow.
  • OFF is a new feature representation from orthogonal space of optical flow on feature level.
    • Spatial gradients of feature maps in horizontal, vertical directions
    • Temporal gradients
  • Hand-crafted features
  • Deep-features
    • Optical flow
    • 3D CNN
    • RNN
  • OFF
    • Well captures the motion patterns
    • Complementary to other motion representations

3. Optical Flow Guided Feature : OFF

  • Optical Flow

      • : pixel at the location of a frame t
      • : spatial pixel displacement in each axes
  • Apply at feature level

      • : mapping function for extracting features from image
      • : parameters in
  • According to definition of optical flow

      • : feature level optical flow
  • OFF :
    • Orthogonal to feature level optical flow and changes as it changes.
    • Encodes spatial-temporal information orthogonally and complementarily to

4. Using Optical Flow Guided Feature in CNN

4.1. Network Architecture

Feature Generation Sub-network

Alt text

  • BN-Inception for extracting feature map

OFF Sub-network

Alt text

  • 1x1 convolutional layer
  • Apply Sobel operator for spatial gradients
  • Element-wise subtraction for temporal gradients
  • Concatenate features from lower level.

Classification Sub-network

  • Multiple inner-product classifiers for each features
  • Classification scores are averaged

4.2. Network Training

  • th segment on level :
  • Classification score of :

    • is average pooling for summarizing scores
  • Cross-entropy loss for each level

      • : number of categories
      • : ground-truth class label
  • Two-stage training
    • Train feature generation sub-network first.
    • Train classification sub-network with feature network frozen.

4.3. Network Testing

  • Test under TSN framework
  • 25 segments are sampled from RGB
  • th segment is treated as Frame

5. Experiments and Evaluations

5.1. Datasets and Implementation Details

  • UCF-101 / HMDB-51 datasets
  • 4 NVIDIA TITAN X GPUs
  • Caffe & OpenMPI
  • Train feature generation network by TSN method
  • Train OFF sub-networks from scratch with feature generation networks frozen.

5.2. Experimental Investigations of OFF

  • Efficiency
    • State-of-the art among real-time methods

Alt text

  • Effectiveness
    • Investigate the roustness of OFF when applying different inputs.

Alt text

  • Comparison
    • 2.0%/5.7% gain compared with the baseline Two-Stream TSN

Alt text

6. Conclusion

  • OFF is fast(200fps) and robust.
  • The result with only RGB input is comparable to Two-stream approaches.
  • Complementary to other motion representations.


PoTion : Pose MoTion Representation for Action Recognition

PoTion : Pose MoTion Representation for Action Recognition

Vasileios Choutas, Philippe Weinzaepfel, Jérôme Revaud, Cordelia Schmid
Inria, NAVER LABS Europe



Abstract

  • State-of-the-art methods for action recognition rely on two-stream networks
  • Try to process appearance and motion jointly
  • Fixed-sized representation that encodes pose over clip

Alt text

1. Introduction

  • Human pose can be added to multi-stream architecture
  • 3D skeleton
    • Limited to case where the data is available
  • 2D poses
    • Used when fully-visible
    • Features(hand-crafted, CNN) around the human joints
  • Zolfaghari et al.
    • Pose stream for semantic segmentation maps
    • FC + spatio-temporal CNN
  • Propose to focus on the movement of a few relevant keypoints
    • Fixed-sized representation, does not depend on the duration of the clip.
  • Overview of PoTion
    • Obtain heat maps for human joint by running pose estimator (Part Affinity Fields).
    • Colorize heat maps depending on the relevant time.
    • Obtain PoTion representation for clip.
    • Train shallow CNN architecture for action classification.
  • Contributions
    • Clip-level representation that encodes human pose motion, PoTion
    • Study PoTion representation and CNN for action recognition
    • Combin PoTion with two-stream architecture
  • CNNs for action recognition
    • 2D + RNN
    • 3D CNN
    • Two-stream
  • Motion representation
    • Optical flow, etc.
  • Pose representation
    • 3D skeleton
    • 2D pose
    • Approach of Zolfaghari et al.

3. PoTion representation

3.1. Extracting joint heatmaps

  • Obtain human joint heatmaps by Part Affinity Fields
  • Part Affinity Fields
    • Robust to occlusion and truncation
    • Output : joint heatmap & affinity map
    • Use only joint heatmap
  • Is the likelihood of pixel

3.2. Time-dependent heat map colorization

Alt text

  • ‘Colorize’ according to relative time of this frame to
  • For C = 2
  • For multiple channels C
    • Split T frames into C-1 regularly sampled intervals

3.3. Aggregation of colorized heatmaps

Alt text

  • Compute sum of the colorized heat maps
  • Obtain invariant representation by normalizing channel independently.
  • Compute intensity image
    • Encodes how much time a joint stays at each location
  • Normalized PoTion representation : divide by intensity
    • All locations of the motion trajectory are weighted eqally

4. CNN on PoTion representation

Alt text

  • Network architecture

    • 3 blocks with 2 convolutional layers in each block
    • 3 x 3 kernel , stride 2 and then stride 1
    • Global average pooling + FC + softmax after all blocks
    • Batch normalization, ReLU after each Conv.
  • Implementation details

    • Xavier initialization and train from scratch
    • Dropout of 0.25
    • Adam optimizer
    • Batch size of 32
    • 4 hours on single GPU (Titan X)
    • Data augmentation
      • Flipping by swapping channels was efficient
      • Random cropping, smoothing heat maps, shifting did not gain

5. Experimental results

Alt text

5.1. Datasets and metrics

  • HMDB
  • JHMDB
  • UCF101
  • Kinetics
  • Report mean classification accuracy.

5.2 PoTion representation

  • Number of channels
  • Aggregation techniques

5.3. CNN on PoTion

  • Data augmentation
  • Network Architecture
    • Number of convolution layers per block : 2
    • Number of blocks : 3
    • Number of convolution filters : (128, 256, 512)

5.4. Imapct of pose estimation

Alt text

  • Groundtruth pose : 4% gain
  • Crop frames centered on the actor (GT-JHMDB) : 6% gain

5.5. Comparison to the state of the art

Alt text

  • Multi-stream approach
    • Up to +8% on TSN / Up to 3% on I3D
  • Comparison to Zolfahari et al.
    • Improvement due to improved pose motion representation
  • Comparison to the state of the art
    • Outperform all existing approaches
  • Detailed analysis
    • Clear improvement when well defined motion
    • Low performance when object is more important than motion
  • Results on Kinetics
    • Accuracy decreased by 1~2%
    • Reasons of decrease
      • Actors are partially visible
      • Feature erratic camera
      • Multiple shots per clip

6. Conclusion

  • PoTion encodes the motion over entire clip into fixed-size representation.
  • Classification with PoTion representation and shallow CNN.
  • Leads to state-of-the-art performance on JHMDB, HMDB, and UCF-101.

Our discussion

  • Why not use 3D convolution for capturing temporal features?


저녁을먹고 학교에서 집가다가 갑자기 마음이 답답해서 응봉산에 가기로했다.



응봉산은 지하철 응봉역에서 내리면 정말 가깝다.


올라가면 엄청 높은데, 많이 올라가지 않아도 되서?


가볍게 산책하러 올라가서 야경보기 좋은 곳이다.



응봉역 1번출구로 나와서 표시한곳으로 쭉 따라가면 된다.


응봉산 가는길이라고 중간중간에 도로에 표시가 잘 되어있으니 그냥 보면서 쭉 가면 된다.



가는길에 CU 응봉초점 이 있으니 편의점에서 음료수나 간식을 사가도 좋다.


응봉산 정상에는 벤치가 꽤 있어서 정상에서 음료수를 마시기 좋다.





도로를 따라서 올라가다보면 오른쪽에 구름다리가 있다.


구름다리를 건너가는 재미도 있다.




응봉산 정상으로 가는 길은 여러가지이다.


쭉 큰 도로쪽으로 따라가다가 구름다리 있는쯤 해서


산 안으로 들어가는곳이 나오는데, 그곳으로 들어가서 산을 올라가다보면 응봉산 정상이 나온다.






응봉산 정상에는 불이 켜져있는 정자가 있다.


위에 공간이 꽤 넓다.


사방으로 모두 야경을 감상할 수 있다.




데크 같이 해 놓은 곳도 있어서 이곳에서 야경을 구경할 수 있다.




저곳에 서 있으면 이렇게 이쁜 야경을 구경할 수 있다.






다른쪽에서는 남산도 보인다.


불 켜져있는것이 참 예쁘다.





응봉역에서 정말 얼마 안걸리기 때문에(한 15분?) 


잠깐 놀러다녀오기 좋다.


산을 많이 안타는데도 정상에 올라가면 높아서 적은 힘을 들이고 아주 멋진 야경을 감상할 수 있다.





요즘 날씨도 좋으니 응봉산 산책 강추!





+ Recent posts