Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the 3D structure was captures in ESM-3 model #91

Open
anonimoustt opened this issue Aug 28, 2024 · 5 comments
Open

How the 3D structure was captures in ESM-3 model #91

anonimoustt opened this issue Aug 28, 2024 · 5 comments

Comments

@anonimoustt
Copy link

Hi,

I was checking ESM-3 structure embedding, and ESM-3 sequence embedding, and found that the distance between embeddings is very less ( 0.0001) . I am curious how ESM-3 model is pre-trained with 3D structure of the protein sequences. Do you have any paper or documentation on ESM-3 from where I can get to know how ESM-3 capture 3D structure?

@ebetica
Copy link
Contributor

ebetica commented Aug 28, 2024

Have you seen our biorxiv paper? https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1

@anonimoustt
Copy link
Author

Hi,

Thanks for your reply. I will check it. Is it possible to specifically answer whether the ESM-3 is trained on AlphaFold 3 D ( full sequence structure ) of the human protein sequences?

@ebetica
Copy link
Contributor

ebetica commented Aug 29, 2024

Yes data from the AlphaFoldDB was used to train ESM3, that includes human proteins.

@anonimoustt
Copy link
Author

anonimoustt commented Aug 29, 2024

Thanks for your reply. It is really interesting. I was checking embeddings generated using ESM-3 sequence and ESM-3 structure separately. I found the cluster generated using ESM-3 sequence embedding is different from the cluster generated using ESM-3 structure embeddings. If ESM_3 captures both sequence and structure then why clusters are different for ESM-3 sequence and ESM-3 structure embeddings. I have applied Agglomerative Clustering. To investigate further detail I was checking for two protein sequences Q6P3R8 and Q9BYP7 that appear together in ESM-3 sequence based clustering but did not appear together ESM-3 structure based clustering. I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 structure and measure the cosine similarity which is 0.96962. Next, I compute the embeddings for Q6P3R8 and Q9BYP7 using ESM-3 sequence and measure the cosine similarity which is 0.9922. The cosine similarity is very close but Q6P3R8 and Q9BYP7 appear together in the same cluster when using ESM-3 sequence embedding but they are appearing separately when using ESM-3 structure embedding. Should not the cluster be similar using ESM-3 sequence embedding and ESM-3 structure embedding? Why am I getting different clusters or trees?

@anonimoustt
Copy link
Author

Hi, is it possible to re-train ESM-3 model with structure and sequence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants