C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval. Andrew Rouditchenko et al. ICASSP 2023

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.

Check out our ICASSP presentation on YouTube!

Demos

We support two demos:

(1) Multilingual text-video retrieval: given a text query and a candidate set of videos, rank the videos according to the text-video similarity.

(2) Multilingual text-video moment detection: given a text query and clips from a single video, find the most relevant clips in the video according to the text-video similarity.

The model we demo was trained on MultiMSRVTT on text-video pairs in English, Dutch, French, Mandarin, Czech, Russian, Vietnamese, Swahili, and Spanish. However, thanks to LaBSE's pre-training on over 100 languages (https://aclanthology.org/2022.acl-long.62.pdf), text-video retrieval works in many more languages like Ukrainian and Igbo (shown in the demo). You can try it in whatever language you speak / write.

Multilingual text-video retrieval demo:

Multilingual video moment detection demo:

Get started

Repository contains:

code for the main experiments
model weights to obtain main results
data for fine-tuning and evaluation on the Multi-MSRVTT, Multi-YouCook2, Vatex, and RUDDER datasets

Create an environment (tested on May 1st, 2023):

conda create python=3.8 -y -n c2kd
conda activate c2kd
conda install -y pytorch==1.11.0 cudatoolkit=10.2 -c pytorch
pip install numpy==1.19.2 transformers==4.16.2 librosa==0.8.1 timm==0.5.4 scipy==1.5.2 gensim==3.8.3 sacred==0.8.2 humanize==3.14.0 braceexpand typing-extensions psutil ipdb dominate
# optional - for neptune.ai experiment logging
pip install numpy==1.19.2 neptune-sacred

Download the model weights here and the data here. Extract the tars: mkdir data && tar -xvf data.tar.gz -C data and mkdir weights && tar -xvf weights.tar.gz -C weights. They should be in the data and weights directory, respectively.
See ./scripts/ for the commands to train the models with our proposed C2KD knowledge distillation, as well as the baseline translate-train and zero-shot (English-only training) methods.

Note: the results in the paper are the average of 3 runs, so your results might be slightly different than ours.
Note: for YouCook2, the final results are reported with S3D features from MIL-NCE as the performance was better than with the CLIP features. We include S3D and CLIP features for YouCook2 and MSR-VTT.

Experiment Logging

This repository uses Sacred with a neptune.ai for logging and tracking experiments. If you want to activate this:

Create a neptune.ai account.
Create a project, copy in your credentials (api_token, project_name) in train.py
Add --neptune key to the training (e.g. python train.py --neptune ..)

Cite

If you use this code in your research, please cite:

@inproceedings{rouditchenko2023c2kd,
  title={C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval},
  author={Rouditchenko, Andrew and Chuang, Yung-Sung and Shvetsova, Nina and Thomas, Samuel and Feris, Rogerio and Kingsbury, Brian and Karlinsky, Leonid and Harwath, David and Kuehne, Hilde and Glass, James},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Contact

If you have any problems with the code or have a question, please open an issue or send an email.

Acknowledgments and Licenses

The main structure of the code is based on everything-at-once https://github.com/ninatu/everything_at_once and frozen-in-time https://github.com/m-bain/frozen-in-time, which itself is based on the pytorch-template https://github.com/victoresque/pytorch-template.

The code in davenet.py, layers.py, avlnet.py is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
base		base
configs/example		configs/example
data_loader		data_loader
logger		logger
model		model
scripts		scripts
trainer		trainer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
parse_config.py		parse_config.py
readme.md		readme.md
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Demos

Get started

Experiment Logging

Cite

Contact

Acknowledgments and Licenses

About

Releases

Packages

Languages

License

roudimit/c2kd

Folders and files

Latest commit

History

Repository files navigation

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Demos

Get started

Experiment Logging

Cite

Contact

Acknowledgments and Licenses

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages