The SDG Classification Benchmark is an open and public benchmarking dataset for evaluating and comparing SDG classification models. It consists of 1,251 short text snippets (a few sentences each) covering all 17 SDGs. All text snippets were carefully labeled and verified by a team of experts.
You can find the benchmarking dataset here: https://github.com/SDGClassification/benchmark/blob/main/benchmark.csv
The dataset consists of four columns:
id
: Unique identifier for every text (MD5 hash)text
: Text snippet (2 - 3 sentences)sdg
: The SDG that the text is checked againstlabel
: True ifsdg
is present intext
. False otherwise.
Below is an excerpt of the dataset.
id text sdg label
03e9759 Not only does this have potentially negative e... 7 False
04f6c7f If too much water is stored behind the reservo... 7 False
b87a4f8 Energy efficiency targets are now in place at ... 7 True
12e3f54 Data over the last 30 years suggests that, had... 7 True
135ea60 Large areas of about 500 000 km2 between Mumba... 7 False
We have created the Python package sdgclassification-benchmark to make it easy to benchmark SDG classification models written in Python.
First, install the package: pip install sdgclassification-benchmark
Then, run the benchmark on your custom predict_sdgs
method:
from sdgclassification.benchmark import Benchmark
# Take the text to classify, run it through your model and return the list of
# relevant SDGs in *numeric* format.
# IMPORTANT: ADAPT THIS METHOD ACCORDING YOUR MODEL
def predict_sdgs(text: str) -> list[int]:
# ... your model code here ...
return [1, 7, 17]
benchmark = Benchmark(predict_sdgs)
benchmark.run()
You will see the following output:
################################################################################
Running benchmark
Benchmarking |################################| 473/473
Results:
+---------+------+----------------+-----------------+--------------+------------+------+------+------+------+
| SDG | n | Accuracy (%) | Precision (%) | Recall (%) | F1 Score | TP | FP | TN | FN |
|---------+------+----------------+-----------------+--------------+------------+------+------+------+------|
| Average | 78.8 | 83.3 | 93.5 | 71.3 | 0.80 | 28 | 1.8 | 38.2 | 10.8 |
| 3 | 76 | 89.5 | 91.7 | 78.6 | 0.85 | 22 | 2 | 46 | 6 |
| 4 | 82 | 78.0 | 87.9 | 67.4 | 0.76 | 29 | 4 | 35 | 14 |
| 5 | 69 | 73.9 | 100.0 | 48.6 | 0.65 | 17 | 0 | 34 | 18 |
| 6 | 85 | 87.1 | 100.0 | 77.1 | 0.87 | 37 | 0 | 37 | 11 |
| 7 | 100 | 91.0 | 97.7 | 84.0 | 0.90 | 42 | 1 | 49 | 8 |
| 10 | 61 | 80.3 | 84.0 | 72.4 | 0.78 | 21 | 4 | 28 | 8 |
+---------+------+----------------+-----------------+--------------+------------+------+------+------+------+
Benchmark completed
################################################################################
After running the benchmark, you can also access detailed results in your Python script, showing you where the model has made correct and incorrect predictions.
# Convert results into a Pandas dataframe
df = benchmark.results.to_dataframe()
# id text sdg ... predictions predicted_label is_correct
# 0 e0ab386 The Ministry of Education is charged with the ... 3 ... [4] False True
# 1 9177b53 Most of the local government institutions lack... 3 ... [8] False True
# 2 8dd268f Brisbane: School of Population Health, Univers... 3 ... [8, 3] True True
# 3 9b237c8 Authorship is usually collective, but principa... 3 ... [] False True
# 4 52f8fc8 This is a limiting structure, as real-world ev... 3 ... [8] False True
You can also check out some of the example code snippets in this repository:
Using your preferred CSV reader/parser, you can access the benchmarking dataset from here: https://raw.githubusercontent.com/SDGClassification/benchmark/main/benchmark.csv
For example, in R:
df <- read.csv("https://raw.githubusercontent.com/SDGClassification/benchmark/main/benchmark.csv")
You can then classify the text
in each row of the dataset and compare your model's predictions with the true label.
You can find short descriptions about each of the columns in the dataset here.
There exist a number of different models and approaches for classifying texts with respect to the Sustainable Development Goals (SDGs). Refer to this list for a (non-exhaustive) overview of different SDG classifiers and initiatives.
To further advance the field of SDG classification, it is important that we understand the strengths and weaknesses of the available tools.
This public and open benchmarking dataset allows us to evaluate and compare the accuracy of available SDG classifiers, so that we can better grasp their capabilities and limitations.
Our hope is that this benchmarking dataset can contribute to us building more accurate, more reliable and more trustworthy models and approaches in the future.
This benchmarking dataset has been carefully curated by a working group with expert knowledge in the domain of the SDGs. Unlike other datasets, this benchmarking dataset has been fully annotated and verified by humans.
To ensure a robust benchmarking dataset, the texts for each SDG were independently annotated by several human experts. For most texts, the annotations made by these experts were identical. For some texts, however, there initially were disagreements between the annotations. These disagreements were resolved through discussion between the experts until consensus was reached.
In some cases, consensus was reached by making slight modifications to a text in order to remove ambiguity.
The benchmark covers at least 50 texts for each SDG. Note that the number of texts across the SDGs as well as the number of texts with and without SDG are not balanced.
SDG | Number of texts | Texts with SDG | Texts without SDG |
---|---|---|---|
All texts | 1251 | 619 | 632 |
SDG 1: No poverty | 77 | 27 | 50 |
SDG 2: Zero hunger | 69 | 45 | 24 |
SDG 3: Good health and well-being | 76 | 28 | 48 |
SDG 4: Quality education | 82 | 43 | 39 |
SDG 5: Gender equality | 69 | 35 | 34 |
SDG 6: Clean water and sanitation | 85 | 48 | 37 |
SDG 7: Affordable and clean energy | 100 | 50 | 50 |
SDG 8: Decent work and economic growth | 74 | 38 | 36 |
SDG 9: Industry, innovation and infrastructure | 57 | 28 | 29 |
SDG 10: Reduced inequalities | 61 | 32 | 29 |
SDG 11: Sustainable cities and communities | 69 | 27 | 42 |
SDG 12: Responsible consumption and production | 80 | 43 | 37 |
SDG 13: Climate action | 65 | 33 | 32 |
SDG 14: Life below water | 84 | 35 | 49 |
SDG 15: Life on land | 71 | 41 | 30 |
SDG 16: Peace, justice and strong institutions | 68 | 35 | 33 |
SDG 17: Partnerships for the goals | 64 | 31 | 33 |
While we have invested a lot of time and care to ensure that this benchmarking dataset is as robust as possible, there are important limitations to be aware of when using this dataset to evaluate your model.
This benchmark serves as a starting point for evaluating the accuracy of a model. It is important to remember that models may have been developed for domain-specific use cases, languages, and/or other specializations that cannot be comprehensively evaluated with this benchmarking dataset.
We evaluated each text only with respect to whether it directly addresses a given SDG and its targets. This assessment was binary (true
vs false
), which greatly reduced the complexity.
We did not evaluate how strong the reference is, we did not evaluate whether other SDGs are present in the text and we did not evaluate what the primary SDG in each text is.
We have tried to include a variety of texts in this benchmarking dataset in order to cover the breadth of each SDG. For example, for SDG 7, we have included texts related not just to renewable energy, but also texts related to energy access, affordable energy, energy efficiency, etc...
That said, due to the wide spectrum of issues covered by the SDGs, we cannot guarantee that this benchmarking dataset provides exhaustive coverage of all the relevant issues for each of the SDGs.
For the sake of simplicity, we ignored valence and sentiment in the texts. A text was classified as addressing a given SDG, even if the content was discussed in a negative light. For example, "We will reduce renewable energy subsidies" would still be classified as true
for SDG 7.
Texts were only assigned to a given SDG, if the text directly addressed that SDG and/or its targets. "We are subsidizing the costs of silicon" would not qualify for SDG 7, even though silicon is one of the materials for making photovoltaic cells and reduced cost of silicon might have indirect benefits for the expansion of solar energy.
We ignored indirect relevance in texts because correct assessments would require enormous thematic expertise, and even then such interpretations would often remain highly subjective and controversial.
Several models have been tested against this benchmark. The results can be used as one criterion when evaluating models. Importantly, there are a range of other important aspects to consider that are not tested by this benchmark, such as cost, speed, and coverage of SDG targets. In addition, many models have been developed and optimized for domain-specific use cases (e.g. analyzing policy documents, research articles or websites) and their performance on this benchmark may not be representative of their performance on their domain-specific use case.
The table below shows the accuracy (in percent) of models tested against this benchmark. Clicking on one of the models provides additional details, including metadata about the model, F1 score, precision and recall.
Model | Average | SDG 1 | SDG 2 | SDG 3 | SDG 4 | SDG 5 | SDG 6 | SDG 7 | SDG 8 | SDG 9 | SDG 10 | SDG 11 | SDG 12 | SDG 13 | SDG 14 | SDG 15 | SDG 16 | SDG 17 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AFD SDG Prospector | 88 | 95 | 87 | 91 | 95 | 81 | 92 | 95 | 89 | 86 | 87 | 90 | 84 | 89 | 83 | 82 | 91 | 77 |
Aurora SDG | 79 | 79 | 83 | 80 | 79 | 93 | 81 | 85 | 81 | 70 | 70 | 81 | 76 | 86 | 81 | 75 | 65 | 81 |
Global Goals Directory | 81 | 90 | 80 | 90 | 78 | 74 | 87 | 91 | 84 | 81 | 80 | 78 | 74 | 78 | 84 | 80 | 76 | 72 |
JRC SDG Mapper | 77 | 82 | 71 | 84 | 70 | 75 | 73 | 86 | 77 | 77 | 70 | 75 | 79 | 83 | 82 | 82 | 75 | 72 |
Meta Llama 2 70B | 83 | 90 | 83 | 93 | 90 | 93 | 87 | 93 | 78 | 74 | 79 | 68 | 80 | 91 | 84 | 84 | 69 | 75 |
Meta Llama 3 70B | 86 | 77 | 90 | 86 | 85 | 91 | 92 | 91 | 80 | 84 | 92 | 84 | 84 | 92 | 81 | 89 | 81 | 81 |
Mixtral 8x7B | 85 | 84 | 86 | 83 | 89 | 87 | 88 | 94 | 80 | 75 | 95 | 81 | 82 | 92 | 81 | 82 | 79 | 83 |
OSDG v1 | 70 | 87 | 56 | 70 | 80 | 64 | 84 | 82 | 77 | 56 | 72 | 71 | 71 | 66 | 77 | 66 | 52 | 53 |
OpenAI GPT-3.5 Turbo | 80 | 65 | 93 | 82 | 80 | 87 | 91 | 90 | 76 | 68 | 87 | 71 | 79 | 91 | 76 | 83 | 78 | 67 |
OpenAI GPT-4 Turbo | 85 | 75 | 93 | 86 | 84 | 88 | 92 | 92 | 77 | 79 | 92 | 77 | 85 | 94 | 80 | 92 | 84 | 83 |
OpenAI GPT-4o | 87 | 79 | 94 | 82 | 84 | 94 | 93 | 93 | 80 | 84 | 93 | 88 | 86 | 94 | 82 | 90 | 81 | 84 |
text2sdg | 83 | 91 | 74 | 82 | 88 | 93 | 85 | 91 | 85 | 82 | 79 | 80 | 75 | 91 | 82 | 80 | 79 | 75 |
Have you benchmarked a model that is not yet in our list? Please open an issue and share the results with us, so that we can add the model to the table above.
The core contributors meet regularly to improve and expand this benchmarking dataset. If you want to join us, please reach out to [email protected].
This benchmarking dataset is a living document and we will continue to make adjustments and improvements based on feedback. Please open an issue to share your ideas, suggestions, and feedback.
The benchmarking dataset has been created by the Benchmarking Working Group of the SDG Classification Expert Group. Core contributors are Finn Woelm, Meike Morren, Steve Borchardt, Ivan Smirnov, Gib Ravivanpong, and Jean-Baptiste Jacouton.
We especially thank our annotators for making this project possible.
- SDG 1: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 2: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 3: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 4: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 5: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 6: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 7: Steve Borchardt, Jean-Baptiste Jacouton, Gib Ravivanpong, Finn Woelm
- SDG 8: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 9: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 10: Steve Borchardt, Jean-Baptiste Jacouton, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 11: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 12: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 13: Steve Borchardt, Meike Morren, Gib Ravivanpong, Finn Woelm
- SDG 14: Steve Borchardt, Meike Morren, Finn Woelm
- SDG 15: Steve Borchardt, Meike Morren, Ivan Smirnov, Finn Woelm
- SDG 16: Steve Borchardt, Meike Morren, Ivan Smirnov, Finn Woelm
- SDG 17: Steve Borchardt, Meike Morren, Gib Ravivanpong, Ivan Smirnov, Finn Woelm
We used the fantastic OSDG Community Dataset for selecting texts to include in our benchmarking dataset. The OSDG Community Dataset has been created by 1,400+ volunteers and is available under a CC-BY 4.0 license.
Citation: OSDG, UNDP IICPSD SDG AI Lab, & PPMI. (2024). OSDG Community Dataset (OSDG-CD) (Version 2024.01) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10579179