Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for multimers via residue_index hack #25

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sokrypton
Copy link

No description provided.

@sokrypton
Copy link
Author

sokrypton commented Aug 14, 2022

I'm not sure if I implemented RoPE() residue index offset correctly.

Setting residue index offset for relative-positional embedding works, but if you touch RoPE, it crashes!

Here is what I tried for RoPE():

# default
position = torch.arange(total_length).cuda()
# custom residue index (with 200 offset between chains)
position = residue_index
# repeating residue index
position = torch.arange(subunit_length).repeat(subunits).cuda()

image

@sokrypton
Copy link
Author

position = 500 + torch.arange(total_length)

Also works!
So it's not an absolute thing... but it seems you need to preserve the relative encoding for OmegaFold to work.
image

I'm beginning to suspect that the reason why homo-oligomers work is that there are "fused" examples in the language model training set.

@RuiWang1998
Copy link
Contributor

Hi @sokrypton,

Thanks for this info, we'll take a look into it.

Best

@RuiWang1998
Copy link
Contributor

Hi Dr. @sokrypton,

Would you care to give us an example input for this pr?

@sokrypton
Copy link
Author

sokrypton commented Aug 15, 2022

For hetero-dimer:
https://www.rcsb.org/structure/7M5F

>H1065
AKNSLTTKSLFKEMTIQGIKFTPENVVGAAKDNSGKIIFLEKGNSKSGLQHIVEEHGDQFAQIGVSEARIPDVVMKAVTDGKIVGYQGAGAGRPIYETMIDGKKYNIAVTVGSNGYVVGANLRGSVK:MKEIKLMADYHCYPLWGTTPDDFGDISPDELPISLGLKNSLEAWAKRYDAILNTDDPALSGFKSVEEEKLFIDDGYKLAELLQEELGSAYKVIYHADY

For homo-oligomer:

>tmp
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK
>tmp
PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK

@RuiWang1998
Copy link
Contributor

Hi Dr. @sokrypton,

We have reproduced your results and agree that it is really peculiar. We are going to take a deeper look into the reason for this.

As for the training set of the language model. We directly take them from Uniref 50 sequences.

@sokrypton
Copy link
Author

sokrypton commented Aug 17, 2022

I guess... since the LM was trained on single-chains. There is no reason to expect it to generalize to proteins it hasn't seen before, especially protein multimers. I suspect it sometimes works for protein multimers when the multi-chain protein looks like a multi-domain protein in the LM training set.

@RuiWang1998
Copy link
Contributor

Hi,

We have been looking into this issue and found one problem with RoPE not necessarily regarding this problem, namely that it is symmetric, which may not be what we want and we are phasing it out soon.

But still this problem persists and we do not really have an idea yet.

@JackMaguire
Copy link

@sokrypton @RuiWang1998 As a hungry user, are the problems with the current branch technical or scientific?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants