GitHub - soyoung97/ListT5: official repository for ListT5

Official code repository for the paper: ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval (ACL 2024, main)

Please email to soyoung.yoon@snu.ac.kr for any questions about this work!
2024.3.20 Added minimal running examples!
2024.5.16 Accepted to ACL 2024 main! :)
2024.6.15 Started on the code release...
2024.7.25 Uploaded instructions to download training data. Expected to update the training code before the conference starts! Stay tuned:)
2024.10.15 I'm preparing to organize and open-source the training code, but until then, please refer to the comments here to get some information about training ListT5.

Model checkpoints (huggingface)

RankT5-base: Soyoung97/RankT5-base
RankT5-large: Soyoung97/RankT5-large
RankT5-3b: Soyoung97/RankT5-3b
ListT5-base: Soyoung97/ListT5-base
ListT5-3b: Soyoung97/ListT5-3b

Evaluation Datasets (BEIR)

Please note that the license follows the original BEIR/MSMARCO license (for academic purposes only) and we are not responsible for any copyright issues.

BEIR top-100 by BM25 Soyoung97/beir-eval-bm25-top100
BEIR top-1000 by BM25 Soyoung97/beir-eval-bm25-top1000
BEIR top-100 by COCO-DR Large Soyoung97/beir-eval-cocodr-large-top100

Tip: click on the indiviual file link, copy the download link, and use wget to download each file on your server. example:

wget https://huggingface.co/datasets/Soyoung97/beir-eval-cocodr-large-top100/resolve/main/nfcorpus.jsonl

The training data used are processed at: https://huggingface.co/datasets/Soyoung97/ListT5-train-data/ and if you wish to get access, please contact soyoung.yoon@snu.ac.kr with your huggingface id!
The training data format looks like the following:

- please refer to the comments [here](#2 (comment)) to get some information about the training code for ListT5.

Running environments

conda env create -f listt5_conda_env.yml

Note: torch=2.1.0 and transformers=4.33.3 (Other versions can be incompatible) a version mismatch can result in errors such as:

AttributeError: 'EncoderWrapper' object has no attribute 'embed_tokens'

How to use

Minimal Running example

Minimal running example can be run by the following example code.

from FiDT5 import FiDT5 # Need to clone the repository for this code
from transformers import T5Tokenizer
model = FiDT5.from_pretrained('Soyoung97/ListT5-base')
texts = [
"Query: When did Thomas Edison invent the light bulb?, Index: 1, Context: Lightning strike at Seoul National University", 
"Query: When did Thomas Edison invent the light bulb?, Index: 2, Context: Thomas Edison tried to invent a device for car but failed",
"Query: When did Thomas Edison invent the light bulb?, Index: 3, Context: Coffee is good for diet",
"Query: When did Thomas Edison invent the light bulb?, Index: 4, Context: KEPCO fixes light problems",
"Query: When did Thomas Edison invent the light bulb?, Index: 5, Context: Thomas Edison invented the light bulb in 1879"]
tok = T5Tokenizer.from_pretrained('t5-base')
raw = tok(texts, return_tensors='pt', padding='max_length', max_length=128, truncation=True)
input_tensors = {'input_ids': raw['input_ids'].unsqueeze(0), 'attention_mask': raw['attention_mask'].unsqueeze(0)}
output = model.generate(**input_tensors, max_length=7, return_dict_in_generate=True, output_scores=True)
output_text = tok.batch_decode(output.sequences, skip_special_tokens=True) 
output_text
>>> [3 1 4 2 5]

Run tournament sort with ListT5

python3 run_listt5.py --input_path ./trec-covid.jsonl --output_path ./outputs/listt5-trec-covid.jsonl --bsize 20

Sample code outputs ndcg@10: 0.78285 for ListT5-base with out_k=2.

Run baseline models (MonoT5, RankT5)

python3 run_monot5_rankt5.py --input_path ./trec-covid.jsonl --output_path ./outputs/monot5-trec-covid.jsonl --bsize 20 --mode monot5

ndcg@10: 78.26 for monot5

python3 run_monot5_rankt5.py --input_path ./trec-covid.jsonl --output_path ./outputs/rankt5-trec-covid.jsonl --bsize 20 --model Soyoung97/rankt5-base --mode rankt5

ndcg@10: 77.731 for rankt5

Please download the above evaluation datasets to try out evaluation for other datasets. You can easily run evaluation with your own dataset by adhering with the data format just like the beir dataset.

Citation

If you find this paper & source code useful, please consider citing our paper:

@misc{yoon2024listt5listwisererankingfusionindecoder,
      title={ListT5: Listwise Reranking with Fusion-in-Decoder Improves Zero-shot Retrieval}, 
      author={Soyoung Yoon and Eunbi Choi and Jiyeon Kim and Hyeongu Yun and Yireun Kim and Seung-won Hwang},
      year={2024},
      eprint={2402.15838},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2402.15838}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
instructions		instructions
positional_bias_experiments		positional_bias_experiments
FiDT5.py		FiDT5.py
README.md		README.md
beir_eval.py		beir_eval.py
beir_length_mapping.py		beir_length_mapping.py
listt5_conda_env.yaml		listt5_conda_env.yaml
run_listt5.py		run_listt5.py
run_monot5_rankt5.py		run_monot5_rankt5.py
test.py		test.py
trec-covid.jsonl		trec-covid.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model checkpoints (huggingface)

Evaluation Datasets (BEIR)

Running environments

How to use

Minimal Running example

Run tournament sort with ListT5

Run baseline models (MonoT5, RankT5)

Citation

About

Releases

Packages

Languages

soyoung97/ListT5

Folders and files

Latest commit

History

Repository files navigation

Model checkpoints (huggingface)

Evaluation Datasets (BEIR)

Running environments

How to use

Minimal Running example

Run tournament sort with ListT5

Run baseline models (MonoT5, RankT5)

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages