The MERIT Dataset 🎒📃🏆

Modelling and Efficiently Rendering Interpretable Transcripts

The MERIT Dataset is a multimodal dataset (image + text + layout) designed for training and benchmarking Large Language Models (LLMs) on Visually Rich Document Understanding (VrDU) tasks.

You can find more about our work here:

Paper: Read all the details of our pipeline and dataset. Under revision 🛠️
The MERIT Dataset: Explore the raw data.
Plug-and-play subsets. The subsets that we have used in our benchmark by using the HuggingFace pipeline 🤗. Under development 🛠️
Our benchmark. We include all the details of our benchmark with result metrics.
You can find our data generation pipeline in this repo.

Introduction ℹ️

AI faces some dynamic and technical issues that push end-users to create and gather their own data. In addition, multimodal LLMs are gaining more and more attention, but datasets to train them might be improved to be more complex, more flexible, and easier to gather/generate.

In this research project, we identify school transcripts of records as a suitable niche to generate a challenging multimodal dataset (image + text + layout) for Token Classification. Thus, we introduce the multimodal MERIT Dataset (image + text + layout) and its generation pipeline.

By introducing this dataset, we expect to contribute to:

Benchmarking LLMs' generalization capabilities.
Improving explainability in multimodal LLMs.
Detecting biases in LLMs.
Pretraining LLMs with more challenging data.
Solving the Token Classification task in a challenging context.

Pipeline 🔄

Our generation pipeline consists of two modules: in the first one, we generate labeled digital samples, while in the second one, we transform these samples in Blender to improve photorealism and generate realistic scenarios.

To generate the digital samples, we start by defining 14 templates (7 in English and 7 in Spanish) in a text editor. These templates are populated with keywords that our pipeline replaces with text from databases (names, subjects, and grades). We classify the textual content of our dataset into more than 400 labels.

To transform our samples, we use a Blender module that can run cloth simulations to fold the document, apply textures over the document, or render with different lighting conditions.

You can learn more details of our generation pipeline in our paper.

Dataset 📑📷

Our dataset contains 33k multimodal samples and is fully labeled. In addition, it features a blueprint that gathers relevant data for every sample, including student name, average grade, rendering conditions, and more.

Each sample has two versions: one with Blender modifications and another without them. Additionally, each PNG is mapped to a JSON file where the labels are stored. The labels follow the FUNSD dataset format.

You can download the MERIT Datset raw data (please notice the complete dataset weights +200GB). You can train with our plug-and-play subsets by using the HuggingFace 🤗 pipeline. (Under development 🛠️)

Software 👩‍💻

We run the pipeline on Ubuntu 20.04. It is designed to run on Windows too, although it has not been tested yet.

Requirements file under development 🛠️

Hardware ⚙️

We ran the pipeline on an MSI Meg Infinite X 10SF-666EU with an Intel Core i9-10900KF and an Nvidia RTX 2080 GPU, running on Ubuntu 20.04. Energy values in the table refer to 1k samples, and time values refer to one sample.

Task	Energy (kWh)	Time (s)
Generate digital samples	0.016	2
Modify samples in Blender	0.366	34

Benchmark 💪

We train the LayoutLM family models on token classification to demonstrate the suitability of our dataset. The MERIT Dataset poses a challenging scenario with more than 400 labels.

We benchmark on three scenarios with an increasing presence of Blender-modified samples.

Scenario 1: We train and test on digital samples.
Scenario 2: We train with digital samples and test with Blender-modified samples.
Scenario 3: We train and test with Blender-modified samples.

	Scenario 1	Scenario 2	Scenario 3	FUNSD/	Lang.	(Tr./Val./Test)
	Dig./Dig.	Dig./Mod.	Mod./Mod	XFUND
	F1	F1	F1	F1
LayoutLMv2	0.5536	0.3764	0.4984	0.8276	Eng.	7324 / 1831 / 4349
LayoutLMv3	0.3452	0.2681	0.6370	0.9029	Eng.	7324 / 1831 / 4349
LayoutXLM	0.5977	0.3295	0.4489	0.7550	Spa.	8115 / 2028 / 4426

Access our paper to read our conclusions.

Biases 🕵️

Although synthetic, we acknowledge that our data generation involves sensitive topics. A solution to avoid biases might be to randomize students' grades. However, this approach would be useless if we want to detect biases in LLMs. Therefore, we have decided to associate students' grades based on PISA reports. You can discover more information in the paper.

Educative System	Origin	Country	PISA Score	System Score
USA Schools	English	USA	490	68
	Spanish	El Salvador	360	46
	Chinese	China	528	76
	Indian	India	336	44
Spanish Schools	Spanish	Spain	477	6.54
	Arabic	Morocco	356	4.56
	Subsaharian	Senegal	306	4.06

Team

We are researchers from Comillas Pontifical University

Ignacio de Rodrigo @nachoDRT: PhD Student. Software Development, Dataset Curation, Benchmark Design.
Alberto Sánchez @ascuadrado: Research Assistant. Software Development.
Mauro Liz @mauroliz: Research Assistant. Software Assistance.
Jaime Boal @jboal: Supervisor.
Álvaro López @allopez: Supervisor.

Citation

If you find our research interesting, please cite our work. 📃✒️

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
benchmark		benchmark
figs		figs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo-samples.gif		demo-samples.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The MERIT Dataset 🎒📃🏆

Modelling and Efficiently Rendering Interpretable Transcripts

Introduction ℹ️

Pipeline 🔄

Dataset 📑📷

Software 👩‍💻

Hardware ⚙️

Benchmark 💪

Biases 🕵️

Team

Citation

About

Releases

Packages

Contributors 2

Languages

License

nachoDRT/MERIT-Dataset

Folders and files

Latest commit

History

Repository files navigation

The MERIT Dataset 🎒📃🏆

Modelling and Efficiently Rendering Interpretable Transcripts

Introduction ℹ️

Pipeline 🔄

Dataset 📑📷

Software 👩‍💻

Hardware ⚙️

Benchmark 💪

Biases 🕵️

Team

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages