Structured data is at the heart of machine learning. LLMs offer a convenient way to generate structured data based on unstructured inputs. This book gives hands-on examples of the different steps in the extraction workflow using LLMs.
You can find more background on the topics covered in this book in our review article.
This book is based on Jupyter notebooks. That is, beyond simply reading along, you can also run the notebooks yourself. You have different options to do so.
You can start running most parts by clicking on this link. This will take you to the JupyterHub of Base4NFDI where the notebook can be run on a small CPU instance. We're working on making it possible to also run the GPU-intensive parts.
If you have a reasonably modern computer you will be able to run many of the notebooks on your own hardware. Note, however, that certain notebooks will need to be run on GPUs. Those notebooks have a note about this on the top of the notebook.
In addition to hardware, you will also need some software. All relevant dependencies can be installed via the package for this online book.
Overall, you will need to run through the following steps. Note that we currently only support Linux and Mac. If you want to run the notebooks on Windows, we recommend that you install WSL and then run the notebooks from the Linux environment.
-
Use Python 3.11 (the code might also work on other versions, but we only tested 3.11)
-
Clone the repository
git clone https://github.com/lamalab-org/matextract-book.git
Then, go into the folder
cd matextract-book
-
(Optional, but recommended) Create a virtual environment:
python3 -m venv .venv
Then activate the environment
source .venv/bin/activate
-
Install dependencies
cd package && pip install .
Running the commands above will install a package called matextract
. We will import it in all notebooks as it sets some plotting styles, but also useful defaults:
- we turn on caching - a very effective way to save money if you use LLMs
- we load some environment variables, such as API keys that you can edit in the
.env
file. This.env
file needs to be in the root directory of the repository - i.e., where the.env.example
file is placed. If you want to know more on how and why to use environment variables and.env
files, you can check this article.
This work was supported by:
- Carl-Zeiss Foundation (Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)
- Intel and Merck (via AWASES programme, Mara Schilling-Wilhelmi, and Kevin Maik Jablonka)
- FAIRmat (Sherjeel Shabih, Christoph T. Koch, José A. Márquez, and Kevin Maik Jablonka)
- Spanish AEI (Martiño Ríos-García, and María Victoria Gil)
- CSIC (Martiño Ríos-García, and María Victoria Gil)
If you use this book in your research, please cite it as follows:
@article{Schilling_Wilhelmi_2025,
title={From text to insight: large language models for chemical data extraction},
ISSN={1460-4744},
url={http://dx.doi.org/10.1039/D4CS00913D},
DOI={10.1039/d4cs00913d},
journal={Chemical Society Reviews},
publisher={Royal Society of Chemistry (RSC)},
author={Schilling-Wilhelmi, Mara and Ríos-García, Martiño and Shabih, Sherjeel and Gil, María Victoria and Miret, Santiago and Koch, Christoph T. and Márquez, José A. and Jablonka, Kevin Maik},
year={2025}
}