Benchmarking Dataset for SDG Classification

The SDG Classification Benchmarking Dataset is an open and public resource for evaluating and comparing SDG classification models. It consists of text snippets (2 - 3 sentences), which have been carefully labeled and verified by a team of human experts.

Note: The benchmarking dataset currently covers SDGs 3, 5, 6, 7 and 10. We will be expanding the benchmarking dataset to other SDGs in the coming months.

How to use

You can find the benchmarking dataset here: https://github.com/SDGClassification/benchmarking-dataset/blob/main/benchmark.csv

The dataset consists of four columns:

id: Unique identifier for every text (MD5 hash)
text: Text snippet (2 - 3 sentences)
sdg: The SDG that the text is checked against
label: True if sdg is present in text. False otherwise.

Below is an excerpt of the dataset.

     id                                               text  sdg  label
03e9759  Not only does this have potentially negative e...    7  False
04f6c7f  If too much water is stored behind the reservo...    7  False
b87a4f8  Energy efficiency targets are now in place at ...    7   True
12e3f54  Data over the last 30 years suggests that, had...    7   True
135ea60  Large areas of about 500 000 km2 between Mumba...    7  False

With Pandas (Python)

The snippet below shows example code for loading the benchmarking dataset as a Pandas dataframe and comparing the expected and predicted labels.

You can check out the full Pandas example, which includes code for printing out aggregated statistics for each SDG (such as accuracy, precision, recall, and F1 score).

import pandas as pd

# Load the benchmarking dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/SDGClassification/benchmarking-dataset/main/benchmark.csv"
)

# Take the text to classify, run it through your model and return the list of
# relevant SDGs in *numeric* format.
# Important: Adapt this method according to your model
def predict_sdgs(text: str) -> list[int]:
    return model.predict(text)


# Classify each text and get the predicted SDGs in *numeric* format
df["predicted_sdgs"] = df["text"].map(predict_sdgs)

# Determine the predicted label by checking whether the predicted SDGs contain
# the SDG from the benchmarking dataset. Predicted label is set to `True` if
# the benchmark's SDG is contained in the predictions.
df["predicted_label"] = df.apply(lambda row: row.sdg in row.predicted_sdgs, axis=1)

# Compare expected and predicted labels
df["is_correct"] = df["label"] == df["predicted_label"]

Background

Purpose

There exist a number of different models and approaches for classifying texts with respect to the Sustainable Development Goals (SDGs). Refer to this list for a (non-exhaustive) overview of different SDG classifiers and initiatives.

To further advance the field of SDG classification, it is important that we understand the strengths and weaknesses of the available tools.

This public and open benchmarking dataset allows us to evaluate and compare the accuracy of available SDG classifiers, so that we can better grasp their capabilities and limitations.

Our hope is that this benchmarking dataset can contribute to us building more accurate, more reliable and more trustworthy models and approaches in the future.

Approach

This benchmarking dataset has been carefully curated by a working group with expert knowledge in the domain of the SDGs. Unlike other datasets, this benchmarking dataset has been fully annotated and verified by humans.

To ensure a robust benchmarking dataset, the texts for each SDG were independently annotated by several human experts. For most texts, the annotations made by these experts were identical. For some texts, however, there initially were disagreements between the annotations. These disagreements were resolved through discussion between the experts until consensus was reached.

In some cases, consensus was reached by making slight modifications to a text in order to remove ambiguity.

Coverage

The dataset currently covers SDGs 3, 5, 6, 7 and 10.

SDG	Number of texts	Texts with SDG	Texts without SDG
SDG 3: Good health and well-being	76	28	48
SDG 5: Gender equality	69	35	34
SDG 6: Clean water and sanitation	85	48	37
SDG 7: Affordable and clean energy	100	50	50
SDG 10: Reduced inequalities	61	32	29

Note that the number of texts across the SDGs as well as the number of texts with and without SDG are currently not balanced. The working group is focusing first on expanding the benchmark to all SDGs with at least 50 texts each — the exact number of texts for each SDG will depend on how difficult it is to reach consensus within the working group. In the long-term, the group is planning to expand the number of texts for each SDG to one hundred.

Limitations

While we have invested a lot of time and care to ensure that this benchmarking dataset is as robust as possible, there are important limitations to be aware of when using this dataset to evaluate your model.

This benchmark serves as a starting point for evaluating the accuracy of a model. It is important to remember that models may have been developed for domain-specific use cases, languages, and/or other specializations that cannot be comprehensively evaluated with this benchmarking dataset.

Binary

We evaluated each text only with respect to whether it directly addresses a given SDG and its targets. This assessment was binary (true vs false), which greatly reduced the complexity.

We did not evaluate how strong the reference is, we did not evaluate whether other SDGs are present in the text and we did not evaluate what the primary SDG in each text is.

Non-exhaustive

We have tried to include a variety of texts in this benchmarking dataset in order to cover the breadth of each SDG. For example, for SDG 7, we have included texts related not just to renewable energy, but also texts related to energy access, affordable energy, energy efficiency, etc...

That said, due to the wide spectrum of issues covered by the SDGs, we cannot guarantee that this benchmarking dataset provides exhaustive coverage of all the relevant issues for each of the SDGs.

Ignores sentiment

For the sake of simplicity, we ignored valence and sentiment in the texts. A text was classified as addressing a given SDG, even if the content was discussed in a negative light. For example, "We will reduce renewable energy subsidies" would still be classified as true for SDG 7.

Non-interpretive

Texts were only assigned to a given SDG, if the text directly addressed that SDG and/or its targets. "We are subsidizing the costs of silicon" would not qualify for SDG 7, even though silicon is one of the materials for making photovoltaic cells and reduced cost of silicon might have indirect benefits for the expansion of solar energy.

We ignored indirect relevance in texts because correct assessments would require enormous thematic expertise, and even then such interpretations would often remain highly subjective and controversial.

Model evaluation

Disclaimer

Several models have been tested against this benchmark. The results can be used as one criterion when evaluating models. Importantly, there are a range of other important aspects to consider that are not tested by this benchmark, such as cost, speed, and coverage of SDG targets. In addition, many models have been developed and optimized for domain-specific use cases (e.g. analyzing policy documents, research articles or websites) and their performance on this benchmark may not be representative of their performance on their domain-specific use case.

Results

The table below shows the accuracy (in percent) of models tested against this benchmark. Clicking on one of the models provides additional details, including metadata about the model, F1 score, precision and recall.

Model	Average	SDG 3	SDG 5	SDG 6	SDG 7	SDG 10
AFD SDG Prospector	89	91	81	92	95	87
Aurora SDG	80	82	81	85	89	66
Global Goals Directory	84	89	74	87	91	80
JRC SDG Mapper	78	84	75	73	86	70
Meta Llama 2 70B	88	93	93	85	92	79
Meta Llama 3 70B	90	86	91	92	91	92
OpenAI GPT-3.5 Turbo	87	82	87	91	90	87
OpenAI GPT-4 Turbo	90	86	88	92	92	92

More models will be added in the future.

Have you benchmarked a model that is not yet in our list? Please open an issue and share the results with us, so that we can add the model to the table above.

Contributing

Join the working group

The core contributors meet regularly to improve and expand this benchmarking dataset. If you want to join us, please reach out to finn@globalgoals.directory.

Suggestions and feedback

This benchmarking dataset is a living document and we will continue to make adjustments and improvements based on feedback. Please open an issue to share your ideas, suggestions, and feedback.

Credits

Core contributors

The benchmarking dataset has been created by the Benchmarking Working Group of the SDG Classification Expert Group. Core contributors are Finn Woelm, Meike Morren, Steve Borchardt, Jean-Baptiste Jacouton and Gib Ravivanpong.

List of annotators

We especially thank our annotators for making this project possible.

SDG 3: Steve Borchardt, Meike Morren, Finn Woelm
SDG 5: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
SDG 6: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
SDG 7: Steve Borchardt, Jean-Baptiste Jacouton, Gib Ravivanpong, Finn Woelm
SDG 10: Steve Borchardt, Jean-Baptiste Jacouton, Meike Morren, Gib Ravivanpong, Finn Woelm

Text snippets

We used the fantastic OSDG Community Dataset for selecting texts to include in our benchmarking dataset. The OSDG Community Dataset has been created by 1,400+ volunteers and is available under a CC-BY 4.0 license.

Citation: OSDG, UNDP IICPSD SDG AI Lab, & PPMI. (2024). OSDG Community Dataset (OSDG-CD) (Version 2024.01) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10579179

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
evaluations		evaluations
examples		examples
scripts		scripts
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.csv		benchmark.csv
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
references.csv		references.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Dataset for SDG Classification

Table of Contents

How to use

With Pandas (Python)

Background

Purpose

Approach

Coverage

Limitations

Binary

Non-exhaustive

Ignores sentiment

Non-interpretive

Model evaluation

Disclaimer

Results

Contributing

Join the working group

Suggestions and feedback

Credits

Core contributors

List of annotators

Text snippets

About

Releases

Packages

Languages

License

SDGClassification/benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Dataset for SDG Classification

Table of Contents

How to use

With Pandas (Python)

Background

Purpose

Approach

Coverage

Limitations

Binary

Non-exhaustive

Ignores sentiment

Non-interpretive

Model evaluation

Disclaimer

Results

Contributing

Join the working group

Suggestions and feedback

Credits

Core contributors

List of annotators

Text snippets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages