Sequence to Sequence - Video to Text


Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).

PDF Slides Poster


S2VT architecture overview

An overview of the S2VT video to text architecture.

ICCV 2015 Spotlight Video.


Sample clips from MPII-MD dataset.

Sample clips from M-VAD dataset.


The code to prepare data and train the model can be found in:

Model information: GitHub_Gist
MSVD Data: Pre-processed MSVD Video Data
Download pre-trained model: S2VT_VGG_RGB_MODEL (333MB)
Vocabulary: S2VT_vocabulary
Evaluation Code:


Caffe Compatibility
The network is currently supported by the recurrent branch of the Caffe fork in my repository or Jeff's repository but are not yet compatible with the master branch of Caffe.


The datasets used in the paper are available at these links:

Microsoft Video Description Dataset (Youtube videos):
Project Page -
MPII Movie Description (MPII-MD) Dataset:
Montreal Video Annotation Description (M-VAD) Dataset:


If you find this useful in your work please consider citing:

          title = {Sequence to Sequence -- Video to Text},
          author = {Venugopalan, Subhashini and Rohrbach, Marcus and Donahue, Jeff 
                    and Mooney, Raymond and Darrell, Trevor and Saenko, Kate},
          booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
          year = {2015}