Skip to content

Commit

Permalink
Merge pull request #14 from microsoft/jjimenezluna/revpriv
Browse files Browse the repository at this point in the history
feat: data for r1
  • Loading branch information
josejimenezluna authored Jun 13, 2023
2 parents 7ebce7e + 3706b66 commit 525416e
Show file tree
Hide file tree
Showing 10 changed files with 22,185 additions and 5 deletions.
5 changes: 0 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,6 @@ repos:
rev: 3.8.3
hooks:
- id: flake8
- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
args: ["--profile", "black", "--filter-files"]
- repo: local
hooks:
- id: pyright
Expand Down
8 changes: 8 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
This directory contains several datasets used throughout the study. In summary:

* `production_public.csv` contains information in regards to the molecule pairs presented to the participants of the study during the production runs (a bit over 5000 pairs). The binary label indicates whether the `smiles_j` compound was chosen (1) or not (0).
* `pre_r{1, 2}.csv` contains participant responses for the first and second preliminary rounds of the study.
* `fragment_scores.csv` contains ~8000 fragments extracted using [BRICS](https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmdc.200800178) from the training data extracted from ChEMBL, their associated MolSkill scores as well as their frequency. This was used as part of the analyses for Figure 5 in the main manuscript.
* `smilesrnnhc_{best|worst}_target9_coef1.0.csv` contains de novo generated molecules using the SMILES RNN algorithm and hill climbing optimization as made available by the [GuacaMol](https://github.com/BenevolentAI/guacamol_baselines) baselines python package, and using ZINC250k as the baseline library. Generated molecules were biased to have MolSkill scores close to either `-9` or `9` (for `best` and `worst` files, respectively, as seen on Figure 6). The default RNN model accompanied by `SquaredModifier` score modifier class with a coefficient of `1.0` were used for these analyses.
* `other_dbs/*.csv` contain data in regards to the analyses run for Figure 4 in the study. Specifically, NIBR-filtered compounds for ChEMBL, the FDA-approved DrugBank, and the GDB extracted sets are present here.
* `assets/chembl_population_mean_std.csv` contains population level statistics that are used during default model training/evaluation for normalization of the descriptors used.
8,998 changes: 8,998 additions & 0 deletions data/fragment_scores.csv

Large diffs are not rendered by default.

2,387 changes: 2,387 additions & 0 deletions data/other_dbs/chembl_maxphase.csv

Large diffs are not rendered by default.

733 changes: 733 additions & 0 deletions data/other_dbs/dbfda-world-nibrfiltered.csv

Large diffs are not rendered by default.

8,617 changes: 8,617 additions & 0 deletions data/other_dbs/gdb13-17-nibrfiltered.csv

Large diffs are not rendered by default.

219 changes: 219 additions & 0 deletions data/pre_r1.csv

Large diffs are not rendered by default.

221 changes: 221 additions & 0 deletions data/pre_r2.csv

Large diffs are not rendered by default.

501 changes: 501 additions & 0 deletions data/smilesrnnhc_best_target9_coef1.0.csv

Large diffs are not rendered by default.

501 changes: 501 additions & 0 deletions data/smilesrnnhc_worst_target9_coef1.0.csv

Large diffs are not rendered by default.

0 comments on commit 525416e

Please sign in to comment.