Video dialogue: Introduction

7 minute read


Historically, having a system which can discuss as well as interact with you about the football matches/ movies.. with its own knowledge has been considered a very ambitious goal. More than the current AI Visual Model nowsaday, that system must be able to infer video from the past, describe the present, and predict the future. In other words, our system’s capacity must be enough reproduce human intelligent level in video understanding.

Imagine that our system can play the role as 1 of these 2 analysts, after/ while watching the game, it can:

  • Create a conversation, give comments on its opinion, and then answer to our questions.
  • Analyze the match with us

Take the simple dialogue below as example

Person 1: “The last shot is too bad!”
Person2: “It seems like he was trying to pass to number 19.”

As human, the comment from Person 1 is very straight forward and we can easily give the comment like what Person 2 said without major inconveniences. However, as AI system, it must understand the content of the video then connect the information with the question to know which situation that he is talking about (which situation is referring for the last shot). With the rapid development of Deep Learning, we have a lot of high quality research about Video Question Answering topic, but now we are ready to take a step toward Video dialogue System, which I will introduce in this blog.

Video dialogue Definition

Given a input Video I, a dialogue D containing a sequence of t-1 turns, each of which is a pair of question Qi and answer Ai of turn from 1 to t-1. Our objective is produce the output answer At for the question of turn t Qt

Why video and dialogue?

Let me first try to motivate a little bit, why this intersection is interesting? I think one of the reason is videos are everywhere nowsaday, they describe our life, our motions in the most reality way, on the other hand, dialogue are how we communicate, how make people different. Therefore, if we have a way to connect these two that can lead to a lot of interesting applications. How do you think about chatting with a virtual companion having exactly the same amount of knowledge and life experience like you about the movie while you are watching it?

The main difference of Video dialogue problem is that the search and the reasoning part must be performed over the content of an video and the dialogue at the same time. So to answer the next question, first, our model should be able to remember the information from all previous questions and answers, and then depend on our what the type of current question is, it have to localize the event related to our question and finally do some Computer Vision task (classification, detection).

The table below compares the Video dialogue problem with other problem need model be capable of reasoning. You can see that to solve the video dialogue problem, we need to face with the challenge from understanding visual, temporal of video; memorization questions and answers from dialogue. It is clearly a Multi modality AI Research Problem involving CV, NLP and Knowledge Representation & Reasoning.

Available Dataset

When talking about AI research, the dataset is the key issue to define metrics as well as fair comparison between different approaches, or even to measure how much AI model perform better/worse than human.

Since Video dialogue model includes too much abilities, the benchmark dataset should have enough annotations label to help analyze dialogueue systems and understand their linguistic and visual reasoning capability and limitations in isolation. Further than downstream dataset for Computer Vision or NLP, it should be also explicitly designed to minimise biases that models can exploit without actual reasoning.

The dataset is used for this problem is AVSD dataset which was used in one of the challenge tracks in DSTC7 (The track is also called AVSD). Hereunder is the challenge website for dataset information

This video-dialogue dataset build on top of the Charades videos - a dataset for Multi - Action recognition problem. Hereunder is one video from Charades video

For collecting the dialogues about short videos of Charades (each dialogue consisted of a sequence of questions and answers about an image), they used human to annotated this dataset. Each video will have two parties discussing about events in that. One of the two parties played the role of an answerer who had already watched the video. The answerer answered questions asked by their counterpart – the questioner. The questioner was not allowed to watch the whole video but were able to see the first, middle and last frames of the video as single static images. The two had 10 rounds of QA, in which the questioner asked about the events that happened between the frames. At the end, the questioner summarized the events in the video as a description. The example of dialogue is decribed below.

Evaluation metrics

Regarding evaluation metrics, currently they use standard metrics of text generation problem such as BLEU, ROUGE score to measure the accuracy between generated answers and ground-truth answers by human.

Currently Approach

Currently efforts try to solve the video-dialogue problem as a multi-modality research, which needs to percive and process information from different source (visual, language, audio) to have the general understanding.

In this blog, I want to introduce a research from Salesforce Research Asia, in which, authors directly solve the multi modality problem with a MultiModal Transformer Network (MTN) to encode videos and incorporate information from different modalities. You can find this paper at:

Basically, this paper proposed using a Transformer Network to percive the feature from multi source. The overview of MTN is illustrated above. It has 3 main components, (1) the Encoder component, which extract the feature of visual and audio from video, and lingustic feature from text information. For the visual, they used features extracted from I3D (a pretrained model for activity recognition problem), while the audio features are got from VGGish pretrained model. To project target and attent on multiple inputs, (2) The Decoder layer is used for text feature, and (3) The Query-Aware Attetion Auto Enconder for non-text features.


When it comes to AI problems, there are some parts that we should carefully thinking about.

  1. Data representation: In this video-dialogue problem, our data is video and dialogue which need the representation knowledge from both Computer Vision and NLP field. The most headache problem from video is always memory and computational ability of our computer. We are now unable to compute a DL model on every frames of a video with length 10-15 minutes (13800 frames). Meanwhile, although there are serveral of summarization methods for a dialogue, which one is the most suitable for this problem is still a big concern.
  2. Inferencing: How we connect the information from different source after extracted them? This is an important problems which need reasoning method over data.
  3. Learning: Which should be better way for a video-dialogue model to learn? Recently, unsupervised methods received many interest in research community, however, its performance still can not reach the supervised one. But with the dialogue, do people learn to create their own dialogue by supervised learning?

Thanks for reading! Feel free to contact me if you have any suggestion/ comment for my blog.