Overview¶
PyTorchVideo is an open source video understanding library that provides up to date builders for state of the art video understanding backbones, layers, heads, and losses addressing different tasks, including acoustic event detection, action recognition (video classification), action detection (video detection), multimodal understanding (acoustic visual classification), self-supervised learning.
The models subpackage contains definitions for the following model architectures and layers:
Acoustic Backbone
Acoustic ResNet
Visual Backbone
Self-Supervised Learning
Build standard models¶
PyTorchVideo provide default builders to construct state-of-the-art video understanding models, layers, heads, and losses.
Models¶
You can construct a model with random weights by calling its constructor:
import pytorchvideo.models as models
resnet = models.create_resnet()
acoustic_resnet = models.create_acoustic_resnet()
slowfast = models.create_slowfast()
x3d = models.create_x3d()
r2plus1d = models.create_r2plus1d()
csn = models.create_csn()
You can verify whether you have built the model successfully by:
import pytorchvideo.models as models
resnet = models.create_resnet()
B, C, T, H, W = 2, 3, 8, 224, 224
input_tensor = torch.zeros(B, C, T, H, W)
output = resnet(input_tensor)
Layers¶
You can construct a layer with random weights by calling its constructor:
import pytorchvideo.layers as layers
nonlocal = layers.create_nonlocal(dim_in=256, dim_inner=128)
swish = layers.Swish()
conv_2plus1d = layers.create_conv_2plus1d(in_channels=256, out_channels=512)
You can verify whether you have built the model successfully by:
import pytorchvideo.layers as layers
nonlocal = layers.create_nonlocal(dim_in=256, dim_inner=128)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = nonlocal(input_tensor)
swish = layers.Swish()
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = swish(input_tensor)
conv_2plus1d = layers.create_conv_2plus1d(in_channels=256, out_channels=512)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = conv_2plus1d(input_tensor)
Heads¶
You can construct a head with random weights by calling its constructor:
import pytorchvideo.models as models
res_head = models.head.create_res_basic_head(in_features, out_features)
x3d_head = models.x3d.create_x3d_head(dim_in=1024, dim_inner=512, dim_out=2048, num_classes=400)
You can verify whether you have built the head successfully by:
import pytorchvideo.models as models
res_head = models.head.create_res_basic_head(in_features, out_features)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = res_head(input_tensor)
x3d_head = models.x3d.create_x3d_head(dim_in=1024, dim_inner=512, dim_out=2048, num_classes=400)
B, C, T, H, W = 2, 256, 4, 14, 14
input_tensor = torch.zeros(B, C, T, H, W)
output = x3d_head(input_tensor)
Losses¶
You can construct a loss by calling its constructor:
import pytorchvideo.models as models
simclr_loss = models.SimCLR()
You can verify whether you have built the loss successfully by:
import pytorchvideo.models as models
import pytorchvideo.layers as layers
resnet = models.create_resnet()
mlp = layers.make_multilayer_perceptron(fully_connected_dims=(2048, 1024, 2048))
simclr_loss = models.SimCLR(mlp=mlp, backbone=resnet)
B, C, T, H, W = 2, 256, 4, 14, 14
view1, view2 = torch.zeros(B, C, T, H, W), torch.zeros(B, C, T, H, W)
loss = simclr_loss(view1, view2)
Build customized models¶
PyTorchVideo also supports building models with customized components, which is an important feature for video understanding research. Here we take a standard stem model as an example, show how to build each resnet components (head, backbone, stem) separately, and how to use your customized components to replace standard components.
from pytorchvideo.models.stem import create_res_basic_stem
# Create standard stem layer.
stem = create_res_basic_stem(in_channels=3, out_channels=64)
# Create customized stem layer with YourFancyNorm
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
norm=YourFancyNorm, # GhostNorm for example
)
# Create customized stem layer with YourFancyConv
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
conv=YourFancyConv, # OctConv for example
)
# Create customized stem layer with YourFancyAct
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
activation=YourFancyAct, # Swish for example
)
# Create customized stem layer with YourFancyPool
stem = create_res_basic_stem(
in_channels=3,
out_channels=64,
pool=YourFancyPool, # MinPool for example
)