From 9c1aa84979939ba9be383505a0706f6e7ce10bab Mon Sep 17 00:00:00 2001 From: cody-mar10 Date: Wed, 9 Oct 2024 15:35:31 -0500 Subject: [PATCH] removed usage details to refer to wiki - also added citation info --- README.md | 71 +++++++++++++++++-------------------------------------- 1 file changed, 22 insertions(+), 49 deletions(-) diff --git a/README.md b/README.md index 1430688..df633f5 100644 --- a/README.md +++ b/README.md @@ -125,56 +125,9 @@ Here is a summary of each model: | `pst-small` | 5 | 4 | 5.4M | 400 | | `pst-large` | 20 | 32 | 177.9M | 1280 | -## Embedding new genomes with the pretrained models +## Usage, Finetuning, and Model API -### 1. ESM2 protein embeddings - -You will first need to generate ESM2 protein embeddings. - -The source repository for ESM can be found here: [https://github.com/facebookresearch/esm](https://github.com/facebookresearch/esm). At the beginning of this project, there was not a user friendly way provided by the ESM team to get protein embeddings from a protein FASTA file, so we have provided a repository to do this: [https://github.com/cody-mar10/esm_embed](https://github.com/cody-mar10/esm_embed). We plan to integrate this into the single `pst` executable for simpler usage, including creating an end-to-end pipeline to go from protein FASTA files to PST outputs. - -The ESM team has now provided `esm-extract` utility to do this, but we have not yet integrated protein embeddings generated from this route. - -Here is what ESM2 models are used for each vPST model: - -| vPST | ESM2 | -| :---------- | :-------------------- | -| `pst-small` | `esm2_t30_150M_UR50D` | -| `pst-large` | `esm2_t6_8M_UR50D` | - -#### FASTA File requirements - -The `esm_embed` tool we provide produces protein language model embeddings for each protein in an input FASTA file **IN THE SAME ORDER** as the sequences in the file. We plan to integrate this tool into the `pst` executable. - -Thus, the following are **required** of the input FASTA file: - -1. The file must be sorted to group all proteins from the same genome together -2. For the block of proteins from each genome, the proteins must be in order of their appearance in the genome. -3. The FASTA headers must look like this: `scaffold_#`, where `scaffold` is the genome scaffold name and `#` is the protein numerical ID relative to each scaffold. - - In the event that you have multi-scaffold viruses (vMAGs, etc.), you can either manually orient the scaffolds and renumber the proteins to contiguously count from the first scaffold to the last. This is what was done with the test dataset in the manuscript. - - We provided a utility script `pst graphify` to do this if an input mapping from scaffolds to genomes is provided. See next section. - - TODO: We will explore a more native solution for multi-scaffold viruses that does not require an arbitrary arrangement of scaffolds that should not require changes to the model. - -### 2. Convert protein embeddings to graph format - -Use the `pst graphify` command to convert the ESM2 protein embeddings into graph format. You will need to protein FASTA file used to generate the embeddings, since the embeddings should be in the same order as the FASTA file. The FASTA file should be in prodigal format: -`>scaffold_ptnid # start # stop # strand ....` - -If you did not keep the extra metadata on the headers, you can alternatively provide a simple tab-delimited mapping file that maps each protein name to its strand (-1 or 1 only). - -Further, if you have multi-scaffold viruses, you can provide a tab-delimited file that maps the scaffold name to the genome name to count all proteins from the entire genome instead of each scaffold. - -### 3. Use PST for genome embeddings and contextualized protein embeddings - -Use the `pst predict` command with the input graph-formatted protein embeddings and trained model checkpoint. The test run above shows the minimum flags needed. You can also use `pst predict -h` to see what options are available, but the most important ones will be: - -| Argument | Description | -| :-------------- | :------------------------------------------------------------------------------------ | -| `--file` | Input graph-formatted .h5 file | -| `--outdir` | Output directory name | -| `--accelerator` | Device accelerator. Defaults to "gpu", so you may need to change this to "cpu" | -| `--devices` | Either the number of GPUs or the number of CPU threads (depending on `--accelerator`) | -| `--checkpoint` | Which trained model checkpoint to use. See data availability above. | +Please read the [wiki](https://github.com/AnantharamanLab/protein_set_transformer/wiki) for more information about how to use these models, extend them for finetuning and transfer learning, and the specific model API to integrate new models into your own workflows. ## Manuscript @@ -191,3 +144,23 @@ There are several other repositories associated with the model code and the manu | [genslm_embed](https://github.com/cody-mar10/genslm_embed) | Code to generate [GenSLM](https://github.com/ramanathanlab/genslm) ORF and genome embeddings | | [hyena-dna-embed](https://github.com/cody-mar10/hyena-dna-embed) | Code to generate [Hyena-DNA](https://github.com/HazyResearch/hyena-dna) genome embeddings | | [PST_host_prediction](https://github.com/cody-mar10/PST_host_prediction) | Model and evaluation code for our host prediction proof of concept analysis | + +### Citation + +Please cite our preprint if you find our work useful: + +Martin C, Gitter A, Anantharaman K. (2024) "[Protein Set Transformer: A protein-based genome language model to power high diversity viromics.](https://doi.org/10.1101/2024.07.26.605391)" + +```bibtex +@article { + author = {Cody Martin and Anthony Gitter and Karthik Anantharaman}, + title = {Protein Set Transformer: A protein-based genome language model to power high diversity viromics}, + elocation-id = {2024.07.26.605391}, + year = {2024}, + doi = {10.1101/2024.07.26.605391}, + publisher = {Cold Spring Harbor Laboratory}, + URL = {https://www.biorxiv.org/content/10.1101/2024.07.26.605391v1}, + eprint = {https://www.biorxiv.org/content/10.1101/2024.07.26.605391v1.full.pdf} + journal = {bioRxiv}, +} +```