Skip to content

Image-to-text translation of chemical molecule structures with deep learning (top-5% Kaggle solution)

License

Notifications You must be signed in to change notification settings

arnabneogi86/BMS_Molecular_Translation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BMS Molecular Translation

Top-5% solution to the BMS Molecular Translation Kaggle competition on chemical image-to-text translation.

sample

Summary

Organic chemists frequently draw molecular work using structural graph notations. As a result, decades of scanned publications and medical documents contain drawings not annotated with chemical formulas. Time-consuming manual work of experts is required to reliably convert such images into machine-readable formulas. Automated recognition of optical chemical structures with deep learning helps to speed up research and development in the field.

The goal of this project is to develop a deep learning framework for chemical image captioning. In other words, the project aims at translating unlabeled chemical images into the text formulas. To do that, I work with a large dataset of more than 4 million chemical images provided by Bristol-Myers Squibb.

My solution is an ensemble of seven CNN-LSTM Encoder-Decoder models implemented in PyTorch. The table below summarizes the architectures and main training parameters. The solution reaches the test score of 1.31 Levenstein Distance and places in the top-5% of the competition leaderboard. The detailed summary is provided in this writeup.

models

Project structure

The project has the following structure:

  • codes/: .py main codes with data, model, training and inference modules
  • notebooks/: .ipynb Colab-friendly notebooks for data augmentation and model training
  • input/: input data (not included due to size constraints, can be downloaded here)
  • output/: model weights, configurations and figures exported from the notebooks

Working with the repo

Environment

To work with the repo, I recommend to create a virtual Conda environment from the environment.yml file:

conda env create --name bms --file environment.yml
conda activate bms

Reproducing solution

The solution can then be reproduced in the following steps:

  1. Download competition data and place it in the input/ folder.
  2. Run 01_preprocessing_v1.ipynb to preprocess the data and define a chemical tokenizer.
  3. Run 02_gen_extra_data.ipynb and 03_preprocessing_v2.ipynb to construct additional synthetic images.
  4. Run training notebooks 04_model_v6.ipynb - 10_model_v33.ipynb to obtain weights of base models.
  5. Perform normalization of model predictions using 11_normalization.ipynb.
  6. Run the ensembling notebook 12_ensembling.ipynb to obtain the final predictions.

All training notebooks have the same structure and differ in model/data parameters. Different versions are included to ensure reproducibility. To understand the training process, it is sufficient to go through the codes/ and inspect one of the modeling notebooks. The ensembling code is also provided in this Kaggle notebook.

More details are provided in the documentation within the codes & notebooks.

About

Image-to-text translation of chemical molecule structures with deep learning (top-5% Kaggle solution)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.7%
  • Python 4.3%