L18-Video

HHZZ included in UMich-EECS-498

2025-04-20 416 words 2 minutes

Contents

Video in NN

Introduction

Video: 4D tensor, T x H x W x C, a sequence of frames(images).

Video Classification

Input: T x H x W x C
Output: K classes, actions instead of nouns.

Problems: raw video are BIG, ~1.5GB per minute for SD(640x480x3)

Solutions: on short clips, eg. T=16, H=W=112, low fps

Short Term model

Single-Frame CNN

so you can use a 2D CNN on each frame independently :)

Late Fusion

apply a MLP (pooling/concat/flatten…) on the entire features sequence

However, it is hard to compare low-level motion between each frames :(

Early Fusion

Input: T x H x W x C
Reshape: T x H x W x C -> 3T x H x W (C=3)
CNN on 3T x H x W -> D x H x W then standard 2D CNN process

too extreme, just fuse all the temporal information at the beginning

3D CNN / Slow Fusion

Input: T x H x W x C
each layer: D x T x H x W, using 3D conv and 3D pooling

good for keeping the temporal shift-invariant

from 2014, 3D CNN improved a lot

C3D: the vgg of 3D CNN, ICCV2015

but somehow all these model treat the time and space equally, which is not true in reality 🤔, in fact, we human can recognize the motion just using 3 points

Two-Stream Networks

In early CV, there are ways to extract optical flow

Spatial CNN: extract features from a single frame
- Input: 3 x H x W
- Output: K classes
Temporal CNN: extract features from a sequence of frames
- Input: [2*(T-1)] x 3 x H x W, stack of optical flow
- Output: K classes
Final output: weighted sum of the two streams

Long Term model

Recurrent CNN

recall $$ h_t = f_W(h_{t-1}, x_t) $$

import the RNN (GRU/LSTM……) to deal with the features sequence

a basic and beautiful idea: time sequence + CNN local framing

Attention-based

CVPR2018, Spatial Temporal Self-Attention Network, non local neural networks

trick: initialize the last conv weights of the attention mechanism as ZERO, easy to plug in 3D CNN
but what is the best 3D CNN?
- we do not want to reinvent those amazing 2D CNNs, so we can inflate those model into 3D! CVPR2017 I3D
- inflate conv ops, weights (copy original weights from 2D CNN, repeat T times at time level) into 3D! Init-Finetune!

Without Optical Flow

SlowFast Networks, ICCV2019

dataset

Sports-1M, fine-grained action recognition dataset
UCF101
Kinetics-400

other tasks

Temporal Action Localization
Spatio-Temporal Action Detection