Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.
The code to prepare data and train the model can be found in:
https://github.com/vsubhashini/caffe/tree/recurrent/examples/youtube
Model information: GitHub_Gist
Download pre-trained model: NAACL15_VGG_MEAN_POOL_MODEL
(220MB)
Pre-processed video data:
NAACL15_PRE-PROCESSED_DATA
recurrent
branch of the
Caffe fork
in my
repository or Jeff's
repository
but is not yet
compatible with the master
branch of
Caffe.
If you find this useful in your work please consider citing:
@inproceedings{venugopalan:naacl15,
title={Translating Videos to Natural Language Using Deep Recurrent Neural Networks},
author={Venugopalan, Subhashini and Xu, Huijuan and Donahue, Jeff and
Rohrbach, Marcus and Mooney, Raymond and Saenko, Kate},
booktitle={{NAACL} {HLT}},
year={2015}
}
Also consider citing Long-term Recurrent Convolutional Networks for Visual Recognition and Description.