MGTR-MISS Implementation

MGTR-MISS: More ground truth retrieving based multimodal interaction and semantic supervision

for video description

Jiayu Zhang, Pengjie Tang, Yunlan Tan, Hanli Wang

Overview:

Describing a video with accurate words and appropriate sentence patterns is interesting and challenging. Recently, excellent models have been proposed to generate smooth and semantically rich video descriptions. However, the language generally does not participate in encoding training, and different modalities including vision and language cannot be effectively interacted and accurately aligned. In this work, a novel model named MGTR-MISS which consists of multimodal interaction and semantic supervision, is proposed to generate more accurate and semantically rich video descriptions with the help of more ground truth. In detail, more external language knowledge is firstly retrieved from the ground truth corpus in the training set to capture richer linguistic semantics for the video. Then the visual features and retrieved linguistic features are fed into a multimodal interaction module to achieve effective interaction and accurate alignment between modalities. The output multimodal representation is then fed to a caption generator for language decoding with visual textual attention and semantic supervision mechanisms. Experimental results on the popular MSVD, MSR-VTT and VATEX datasets show that our proposed MGTR-MISS outperforms not only the baseline model but also the recent state-of-the-art methods. Particularly, the CIDEr performances reach to 111.1 and 55.0 on MSVD and MSR-VTT respectively.

Method:

As presented in Fig. 1, at the encoding stage, the vision and linguistic features are firstly extracted from a given video and all descriptions in the training set with a pre-trained model (e.g. CLIP), respectively. The visual features and retrieved top 𝑘 (𝑘 ≥ 1) features of references are integrated into multimodal representation. Then they are fed into an interactive encoder to catch and align more comprehensive spatio-temporal information of vision and language. For decoding, a dual decoding architecture is constructed, which aims to establish stronger connections between vision and language, enabling a seamless conversion from visual to linguistic representations according to jointly training the two modal branches. Furthermore, the retrieved language feature is utilized as prior knowledge to perform semantic correction at each decoding time step X_t, which alleviates the problem of error accumulation in autoregressive decoding. In contrast to the previous works where external corpora are employed as input sources, our proposed approach focuses more on systematic interaction between different modalities and integrating semantic knowledge throughout the entire training process rather than isolated stages.

Fig. 1. The overview of the proposed MGTR-MISS architecture.

Results:

Extensive experiments on three widely-used video captioning datasets, including MSVD, MSRVTT and VATEX, are conducted to rigorously evaluate the effectiveness and superiority of our proposed method. And the popular metrics including BLEU, METEOR, ROUGE-L and CIDEr are used to evaluate the proposed models. The performance of the proposed model and other popular works are shown in Table 1, Table 2 and Table 3.

Table 1 Performance comparison with the state-of-the-art methods on the MSVD dataset. The best results are highlighted in bold and the second-best results are indicated with underscores..

Table 2 Performance comparison with the state-of-the-art methods on the MSR-VTT dataset.

Table 3 Performance comparison with the state-of-the-art methods on the VATEX dataset.

As shown in Fig. 2, examples including videos and corresponding references (GT), are selected from the MSR-VTT2016 dataset for intuitively presenting and comparing the sentences generated by MAN, Baseline and our proposed MGTR-MISS, respectively.

Fig. 2 Examples from the MSR-VTT dataset. ‘‘GT’’, ‘‘MAN’’, ‘‘Baseline’’, and ‘‘MGTR-MISS’’ respectively represent the ground-truth sentences, description generated by the MAN model, baseline model, and the proposed MGTR-MISS model

Source Code:

MGTR-MISS1.0.zip

Citation:

Jiayu Zhang, Pengjie Tang, Yunlan Tan, and Hanli Wang, “MGTR-MISS: More ground truth retrieving based multimodal interaction and semantic supervision for video description,” Neural Networks, vol. 192, pp. 107817:1-13, Jul. 2025.