[go: nahoru, domu]

Skip to content

Latest commit

 

History

History

kosmos-2

Kosmos-2: Grounding Multimodal Large Language Models to the World

[paper] [dataset] [online demo hosted by HuggingFace]

Contents

Checkpoints

The model can be loaded with the HuggingFace's transformers library.

The checkpoint can be downloaded via:

DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vcy0yL2tvc21vcy0yLnB0P3N2PTIwMjMtMDEtMDMmc3Q9MjAyNC0wNC0xMFQxMyUzQTExJTNBNDRaJnNlPTIwNTAtMDQtMTFUMTMlM0ExMSUzQTAwWiZzcj1jJnNwPXImc2lnPTRjWEpJalZSWkhJQldxSGpQZ0RuJTJGMDFvY3pwRFdYaXBtUENVazNaOHZiUSUzRA==" | base64 --decode)
wget -O kosmos-2.pt $DLINK

Setup

  1. Download the recommended docker image and launch it:
alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} nvcr.io/nvidia/pytorch:22.10-py3 bash
  1. Clone the repo:
git clone https://github.com/microsoft/unilm.git
cd unilm/kosmos-2
  1. Install the packages:
bash vl_setup_xl.sh

(Refer to comment for detailed package info)

Alternatively, you can refer to this guide to set up a conda environment.

Demo

We acknowledge ydshieh at HuggingFace for implementing an online demo.

If you would like to host a local Gradio demo, run the following command after setup:

bash run_gradio.sh

GRIT: Large-Scale Training Corpus of Grounded Image-Text Pairs

We introduce GRIT, a large-scale dataset of Grounded Image-Text pairs, which is created based on image-text pairs from a subset of COYO-700M and LAION-2B. We construct a pipeline to extract and link text spans (i.e., noun phrases, and referring expressions) in the caption to their corresponding image regions. More details can be found in the paper.

Download Data

  • GrIT-20M: The split contains about 20M grounded image-caption pairs processed from COYO-700M.

The format of data instance is:

{
  'clip_similarity_vitb32': 0.353271484375, 
  'clip_similarity_vitl14': 0.2958984375, 
  'id': 1795296605919, 
  'url': "https://www.thestrapsaver.com/wp-content/uploads/customerservice-1.jpg", 
  'caption': 'a wire hanger with a paper cover that reads we heart our customers', 
  'width': 1024, 
  'height': 693, 
  'noun_chunks': [[19, 32, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 13, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]], 
  'ref_exps': [[19, 66, 0.019644069503434333, 0.31054004033406574, 0.9622142865754519, 0.9603442351023356, 0.79298526], [0, 66, 0.019422357885505368, 0.027634161214033764, 0.9593302408854166, 0.969467560450236, 0.67520964]]
}
  • clip_similarity_vitb32: The cosine similarity between text and image(ViT-B/32) embeddings by OpenAI CLIP, provided by COYO-700M.
  • clip_similarity_vitl14: The cosine similarity between text and image(ViT-L/14) embeddings by OpenAI CLIP, provided by COYO-700M.
  • id: Unique 64-bit integer ID in COYO-700M.
  • url: The image URL.
  • caption: The corresponding caption.
  • width: The width of the image.
  • height: The height of the image.
  • noun_chunks: The noun chunks (extracted by spaCy) that have associated bounding boxes (predicted by GLIP). The items in the children list respectively represent 'Start of the noun chunk in caption', 'End of the noun chunk in caption', 'normalized x_min', 'normalized y_min', 'normalized x_max', 'normalized y_max', 'confidence score'.
  • ref_exps: The corresponding referring expressions. If a noun chunk has no expansion, we just copy it.

Run the following commands to visualize it:

DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL2tvc21vcy0yL2RhdGEvZ3JpdF9jb3lvLmpzb25sP3N2PTIwMjMtMDEtMDMmc3Q9MjAyNC0wNC0xMFQxMyUzQTExJTNBNDRaJnNlPTIwNTAtMDQtMTFUMTMlM0ExMSUzQTAwWiZzcj1jJnNwPXImc2lnPTRjWEpJalZSWkhJQldxSGpQZ0RuJTJGMDFvY3pwRFdYaXBtUENVazNaOHZiUSUzRA==" | base64 --decode)
wget -O /tmp/grit_coyo.jsonl $DLINK

python data/visualize_grit.py

We recommend using img2dataset to download images, as detailed here.

Evaluation

1. Phrase grounding

We evaluate phrase grounding task on Flickr30k Entities under zero-shot setting:

Model Recall@1 on val split Recall@1 on test split
Kosmos-2 77.8 78.7

More results and evaluation code can be found in evaluation/flickr/README.md

2. Referring expression comprehension

We evaluate referring expression comprehension task on RefCOCO, RefCOCO+, and RefCOCOg under zero-shot setting. We report accuracy metric here.

Model RefCOCO val RefCOCO testA RefCOCO testB RefCOCO+ val RefCOCO+ testA RefCOCO+ testB RefCOCOg val RefCOCOg test
Kosmos-2 52.32 57.42 47.26 45.48 50.73 42.24 60.57 61.65

More results and evaluation code can be found in evaluation/refcoco/README.md

3. Referring expression generation

We evaluate the referring expression generation task on RefCOCOg under zero-shot and few-shot settings. We report Meteor and CIDEr metrics here.

Model Setting Meteor CIDEr
Kosmos-2 zero-shot 12.2 60.3
Kosmos-2 few-shot (k=2) 13.8 62.2
Kosmos-2 few-shot (k=4) 14.1 62.2

We will release the evaluation code in here.

4. Image captioning

We evaluate the image captioning task on Flickr30K Karpathy split test set under the zero-shot setting. We report the CIDEr metric here.

Model CIDEr on Flickr30K
Flamingo-3B 60.6
Flamingo-9B 61.5
Kosmos-1 67.1
Kosmos-2 80.5

We will release the evaluation code in here.

5. Visual question answering

We evaluate the visual question-answering task on the test-dev set of VQAv2 under the zero-shot setting. We report VQA scores obtained from the VQAv2 evaluation server.

Model Accuracy on VQAv2
Flamingo-3B 49.2
Flamingo-9B 51.8
Kosmos-1 51.0
Kosmos-2 51.1

We will release the evaluation code in here.

Training

Preparing dataset

GrIT

After downloading the data from huggingface using img2dataset, you will obtain some tar files. After decompressing them, you can get the images and corresponding JSON files. Then, modify the file path in prepare_grit.py and run this file to get the corresponding tsv files. If a tsv file is too large, you can split it into multiple ones yourself.

After processing all the tar files into tsv files, run generate_config.py to get a config file, which stores the paths of the tsv files. In train.sh, change the --laion-data-dir to the config directory path.

Interleaved data

Interleaved image-text data also needs to be processed in this way. To be updated.

Text data

To be updated.

Train script

After preparing the data, run the following command to train the model.

bash train.sh

More training/instruction-tuning tasks will be updated.

Citation

If you find this repository useful, please consider citing our work:

@article{kosmos-2,
  title={Kosmos-2: Grounding Multimodal Large Language Models to the World},
  author={Zhiliang Peng and Wenhui Wang and Li Dong and Yaru Hao and Shaohan Huang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2306}
}

@article{kosmos-1,
  title={Language Is Not All You Need: Aligning Perception with Language Models},
  author={Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei},
  journal={ArXiv},
  year={2023},
  volume={abs/2302.14045}
}

@article{metalm,
  title={Language Models are General-Purpose Interfaces},
  author={Yaru Hao and Haoyu Song and Li Dong and Shaohan Huang and Zewen Chi and Wenhui Wang and Shuming Ma and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.06336}
}

Acknowledgement

This repository is built using torchscale, fairseq, openclip. We also would like to acknowledge the examples provided by WHOOPS!. We acknowledge ydshieh at HuggingFace for the online demo and the HuggingFace's transformers implementation.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using models, please submit a GitHub issue.