Add SCOPe dataset to our pipeline #67

sfluegel05 · 2024-12-12T14:02:03Z

Our goal is to reproduce the ontology pretraining on a protein-related task. For this, we have already implemented a GO dataset (see #36). The next step would be to add a pretraining task to that. This would give us the following alignment:

stage	chemistry	proteins
unsupervised pretraining	mask pretraining (ELECTRA)	mask pretraining (ESM2, optional)
ontology pretraining	ChEBI	SCOPe
finetuning task	Toxicity, Solubility, ...	GO (MF, BP, CC branches)

SCOPe is a good fit since it is mostly structure-based (unlike GO, which has more complex functional classes). It also has a manageable size (~140,000 entries, similar to ChEBI).

Goal

Add a SCOPe dataset to our pipeline. The data should be processed so that it can be used in the same way as, e.g., the GO data (just with different labels).

Links

SCOPe website: https://scop.berkeley.edu/ (documentation for files: https://scop.berkeley.edu/help/ver=2.08#parseablefiles-2.08)
Latest publication: https://academic.oup.com/nar/article/47/D1/D475/5219094?login=true
Initial publication: https://scop.berkeley.edu/references/murzin-1995-jmb.pdf
PDB (source of their proteins): https://www.rcsb.org/

aditya0by0 self-assigned this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SCOPe dataset to our pipeline #67

Add SCOPe dataset to our pipeline #67

sfluegel05 commented Dec 12, 2024

Add SCOPe dataset to our pipeline #67

Add SCOPe dataset to our pipeline #67

Comments

sfluegel05 commented Dec 12, 2024

Goal

Links