Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text


This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality.

PDF Poster


Our goal is to integrate external linguistic knowledge into existing CNN-RNN based video captioning models.

Integrating External Linguistic Knowledge.

We propose techniques to incorporate distributed word embeddings, and monolingual language models trained on large external corpora of text to improve grammar and descriptive quality of the captioning model.

Late and Deep Fusion.

Late fusion and deep fusion techniques to integrate a language model into the S2VT video description network.


Sample clips from Youtube with model output.

Movie Description Examples (Cherries).

Movie Description Examples (Lemons i.e. Model makes errors).


The code to prepare data and train the model can be found in:

Download pre-trained model: InDomain_DeepFusion_Model (741MB)
Vocabulary: language_fusion_vocabulary
Evaluation Code:


Caffe Compatibility
The network is currently supported by the recurrent branch of the Caffe fork in my repository or Jeff's repository but are not yet compatible with the master branch of Caffe.


The datasets used in the paper are available at these links:

Microsoft Video Description Dataset (Youtube videos):
Project Page -
[Raw Data Download Link] [PROCESSED_DATA]
MPII Movie Description (MPII-MD) Dataset:
Montreal Video Annotation Description (M-VAD) Dataset:


If you find this useful in your work please consider citing:

          title = {Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text},
          author={Venugopalan, Subhashini and Hendricks, Lisa Anne and Mooney, Raymond and Saenko, Kate},
          booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
          year = {2016}