Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetuning with MLM task #70

Open
sj584 opened this issue Oct 30, 2024 · 5 comments
Open

finetuning with MLM task #70

sj584 opened this issue Oct 30, 2024 · 5 comments

Comments

@sj584
Copy link

sj584 commented Oct 30, 2024

Awesome work :)

I am thinking about finetuning this model on specific protein domain with MLM task.
As in, starting from the SaProt model weights, further finetuning on un-labeled dataset.

  1. When I run this code for MLM finetuning,
    python scripts/training.py -c config/Pretrain/saprot.yaml

I only need to change the load_pretrained: True ??

I also want to ask your opinion on this

  1. Given that model does MLM task on both structure token and sequence token,
    Would it be possible to use this model as kind of structure prediction?
    (sequence given -> structure token recovery -> 3D structure reconstruction)
    Or vise versa (i.e. sequence design given structure)

Thank you in advance for your comment!

@sj584
Copy link
Author

sj584 commented Oct 30, 2024

One more thing,
Would it be possible to do PEFT (Parameter Efficient Fine-tuning) as well in this model finetuning?

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Oct 30, 2024

Hi, thank you for your interest in our work and asking some intriguing questions!

When I run this code for MLM finetuning, python scripts/training.py -c config/Pretrain/saprot.yaml. I only need to change the load_pretrained: True ??

Yes. Setting load_pretrained to True enables you to load pretrained SaProt weight as a start point and you could further fine-tune your own model.

Given that model does MLM task on both structure token and sequence token, Would it be possible to use this model as kind of structure prediction? (sequence given -> structure token recovery -> 3D structure reconstruction) Or vise versa (i.e. sequence design given structure)

Interesting question! I think it depends on what kind of MLM task the model was pre-trained on. For SaProt, we didn't force it to predict structure tokens and only amino acid tokens were predicted to compute loss. We discussed this thing in our paper Appendix F. In this case SaProt may not be endowed the capability to do structure prediction. On the contrary, it indeed could do protein sequence design given a structure backbone, see our latest SaprotHub papar Fig 1.g.

Would it be possible to do PEFT (Parameter Efficient Fine-tuning) as well in this model finetuning?

Yes. If you check our code https://github.com/westlake-repl/SaProt/blob/main/model/saprot/base.py you would find we have already implemented LoRA technique for model fine-tuning. You could simply enable it by setting use_lora to True.
image

Hope this could resolve your quesitons. Let me know if I could further help you :)

@sj584
Copy link
Author

sj584 commented Oct 30, 2024

Thanks again for your amazingly helpful and quick reply!

@sj584
Copy link
Author

sj584 commented Nov 5, 2024

Hi, again :)

After I finished studying yours codes,
I was going to use my own dataset for further training.

However, the usage of .mdb file is quite tricky to me.
I was looking through #53 and #16 (https://www.cnblogs.com/sddai/p/10481869.html)
But still hard to understand.

So my question is,

  1. Is there a way which does not use .mdb for training/test?
    I think solely using huggingface module for data processing peft_finetuning might work

  2. If not, could you share your data processing code?
    It would be very helpful if you could provide the code for you to prepare the dataset,
    starting from pdb_path to generating .mdb file.

Many thanks for your help!

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Nov 18, 2024

Hi,

Apology for the late reply! I've been busy working on another project and didn't see the new issue:(

Is there a way which does not use .mdb for training/test?

I think it is not suitble to do so. Our code is based on the .mdb file and we process all data into this format. You could run it smoothly once you convert you data into .mdb file.

If not, could you share your data processing code?

Sure! Please refer to this reply #72 (comment) and you could generate your own .mdb dataset.

Hope this could resolve your problem:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants