vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

bsaldivaremc2 · 2023-07-13T07:51:47Z

When following the instructions of the README.md neither of the commands shown, seem to work out of the box.
So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.

Sample from checkpoint does not work:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000

So I tried to reproduce the vocab with:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt
It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.

Do you have any way to sample from checkpoint without issues?
Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.

The text was updated successfully, but these errors were encountered:

FlexxofIvan · 2023-07-20T15:55:08Z

The same problem, did you solved it?

bsaldivaremc2 · 2023-07-21T15:36:35Z

I did not solve it. But. I am skipping some functionality to make it work with the provided pre-trained model and vocabulary.
I noticed that when the anchor_smiles in decoder.decode (decoder.py) is more than one, there is an error.
So I limited that the anchor_smile would be just one by adding :
if len(anchor_smiles)>1: continue
in hgraph/decoder.py
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) <-Here I added if len(inter_cands) == 0:
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) if len(anchor_smiles)>1: continue if len(inter_cands) == 0:

bsaldivaremc2 · 2023-07-26T07:31:56Z

I probably solved the problem.
It works the first 900 million times you generate.
Instead of the original vocab use this: https://github.com/bsaldivaremc2/hgraph2graph/blob/master/data/chembl/recovered_vocab_2000.txt
python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000.
I captured all motifs that were causing the problem and included them in the original vocab list
I replaced 27 less used motif pairs.
Details of the files here: https://github.com/bsaldivaremc2/hgraph2graph/tree/master/data/chembl

Included generation step with pre-trained model that corrects the issue wengong-jin#47

bsaldivaremc2 added a commit to bsaldivaremc2/hgraph2graph that referenced this issue Jul 26, 2023

Update README.md

ea6f2cd

Included generation step with pre-trained model that corrects the issue wengong-jin#47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

bsaldivaremc2 commented Jul 13, 2023

FlexxofIvan commented Jul 20, 2023

bsaldivaremc2 commented Jul 21, 2023

bsaldivaremc2 commented Jul 26, 2023

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

Comments

bsaldivaremc2 commented Jul 13, 2023

FlexxofIvan commented Jul 20, 2023

bsaldivaremc2 commented Jul 21, 2023

bsaldivaremc2 commented Jul 26, 2023