This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality.
Our goal is to integrate external linguistic knowledge into existing CNN-RNN based video captioning models.
We propose techniques to incorporate distributed word embeddings, and monolingual language models trained on large external corpora of text to improve grammar and descriptive quality of the captioning model.
Late fusion and deep fusion techniques to integrate a language model into the S2VT video description network.
Sample clips from Youtube with model output.
Movie Description Examples (Cherries).
Movie Description Examples (Lemons i.e. Model makes errors).
The code to prepare data and train the model can be found in:
https://github.com/vsubhashini/caffe/tree/recurrent/examples/language_fusion
Download pre-trained model: InDomain_DeepFusion_Model
(741MB)
Vocabulary:
language_fusion_vocabulary
Evaluation Code:
https://github.com/vsubhashini/caption-eval
recurrent
branch of the
Caffe fork
in my
repository or Jeff's
repository
but are not yet
compatible with the master
branch of
Caffe.
The datasets used in the paper are available at these links:
If you find this useful in your work please consider citing:
@inproceedings{venugopalan16emnlp,
title = {Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text},
author={Venugopalan, Subhashini and Hendricks, Lisa Anne and Mooney, Raymond and Saenko, Kate},
booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2016}
}