Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

Open
bsaldivaremc2 opened this issue Jul 13, 2023 · 3 comments

Comments

@bsaldivaremc2
Copy link

When following the instructions of the README.md neither of the commands shown, seem to work out of the box.
So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.

Sample from checkpoint does not work:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000

So I tried to reproduce the vocab with:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt
It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.

Do you have any way to sample from checkpoint without issues?
Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.

@FlexxofIvan
Copy link

The same problem, did you solved it?

@bsaldivaremc2
Copy link
Author

I did not solve it. But. I am skipping some functionality to make it work with the provided pre-trained model and vocabulary.
I noticed that when the anchor_smiles in decoder.decode (decoder.py) is more than one, there is an error.
So I limited that the anchor_smile would be just one by adding :
if len(anchor_smiles)>1: continue
in hgraph/decoder.py
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) <-Here I added if len(inter_cands) == 0:
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) if len(anchor_smiles)>1: continue if len(inter_cands) == 0:

@bsaldivaremc2
Copy link
Author

I probably solved the problem.
It works the first 900 million times you generate.
Instead of the original vocab use this: https://github.com/bsaldivaremc2/hgraph2graph/blob/master/data/chembl/recovered_vocab_2000.txt
python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000.
I captured all motifs that were causing the problem and included them in the original vocab list
I replaced 27 less used motif pairs.
Details of the files here: https://github.com/bsaldivaremc2/hgraph2graph/tree/master/data/chembl

bsaldivaremc2 added a commit to bsaldivaremc2/hgraph2graph that referenced this issue Jul 26, 2023
Included generation step with pre-trained model that corrects the issue wengong-jin#47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants