Abstract
Real-world videos often have complex dynamics; and methods for generating
open-domain video descriptions should be sensitive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of variable
length. To approach this problem, we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos. For this we exploit
recurrent neural networks, specifically LSTMs, which have demonstrated
state-of-the-art performance in image caption generation.
Our LSTM model is trained on video-sentence pairs and learns to associate a
sequence of video frames to a sequence of words in order to generate a
description of the event in the video clip. Our model naturally is able to learn
the temporal structure of the sequence of frames as well as the sequence model
of the generated sentences, i.e. a language model.
We evaluate several variants of our model that exploit different visual features on a
standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
PDF
Slides
Poster
Overview
An overview of the S2VT video to text architecture.
ICCV 2015 Spotlight Video.
Code
The code to prepare data and train the model can be found in:
https://github.com/vsubhashini/caffe/tree/recurrent/examples/s2vt
Model information: GitHub_Gist
MSVD Data:
Pre-processed MSVD Video Data
Download pre-trained model: S2VT_VGG_RGB_MODEL
(333MB)
Vocabulary:
S2VT_vocabulary
Evaluation Code:
https://github.com/vsubhashini/caption-eval
Notes:
- Caffe Compatibility
-
The network is currently supported by the
recurrent
branch of the
Caffe fork
in my
repository or Jeff's
repository
but are not yet
compatible with the master
branch of
Caffe.
Datasets
The datasets used in the paper are available at these links:
Microsoft Video Description Dataset (Youtube videos):
Project
Page - http://www.cs.utexas.edu/users/ml/clamp/videoDescription/
[Raw Data Download
Link]
[PROCESSED_DATA]
[VIDEO
FRAME FEATURES]
MPII Movie Description (MPII-MD) Dataset:
http://www.mpi-inf.mpg.de/movie-description
Montreal Video Annotation Description (M-VAD) Dataset:
http://www.mila.umontreal.ca/Home/public-datasets/montreal-video-annotation-dataset
Reference
If you find this useful in your work please consider citing:
@inproceedings{venugopalan15iccv,
title = {Sequence to Sequence -- Video to Text},
author = {Venugopalan, Subhashini and Rohrbach, Marcus and Donahue, Jeff
and Mooney, Raymond and Darrell, Trevor and Saenko, Kate},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
year = {2015}
}