Skip to content

Installation

georg.kempf edited this page Apr 2, 2024 · 2 revisions

GUIFold

GUI for running jobs with a local installation of AlphaFold2. Supports submission to queuing systems.

Important note: This is an ALPHA version and development is ongoing. Feedback and bug reports are very welcome.

Main Features:

  • Organization of jobs into projects
  • Tracking and restoring jobs
  • Queue submission with memory estimation and submission script template
  • Custom Protein Template
  • Disabling MSA and/or template search
  • Splitting into feature generation (CPU) and prediction (GPU) steps
  • Support of MMseqs2/colabfold MSA pipeline (in addition to original MSA pipeline)
  • Automated pairwise interaction screening (with parallelisation over multiple GPUs)
  • Evaluation pipeline (PAE/pLDDT tables, PAE plots)

See Wiki for more detailed documentation.

Demo-Image

Installation

Installation of the repository and dependencies

The installation requires conda.

mkdir guifold
cd guifold
git clone --recurse-submodules https://github.com/fmi-basel/GUIFold

It is important to clone with the --recurse-submodules option to include the modified alphafold module.

Before proceeding with the installation it is recommended to setup the global configuration file and cluster submission script (if needed).

If you have a separate initialization script for conda (if the initialization is not in your .bashrc) add

source path/to/conda_init.sh

to the beginning of GUIFold/install.sh.

Running the install.sh file will

  • create a conda environment in the same folder
  • install required packages (the packages are listed in the _environment.yml files
  • install the modified alphafold package
  • install GUIFold

To run the setup:

bash GUIFold/install.sh

conda_env will be the default name of the conda repository. The conda env will be installed with an absolute path (which is also needed for activation).

(Optional) Install MMseqs2 to use the colabfold protocols for MSA generation.

Test if the GUI opens (see Troubleshooting):

conda activate /path/to/conda_env
afgui.py

Troubleshooting

Issue: When trying to run afgui.py: ImportError: libXss.so.1: cannot open shared object file: No such file or directory Solution: Add the following path to the LD_LIBRARY_PATH in the command prompt: export LD_LIBRARY_PATH=/path/to/conda_env/x86_64-conda-linux-gnu/sysroot/usr/lib64/:$LD_LIBRARY_PATH

Download of genetic databases and params

Follow instructions in the AlphaFold readme.

Setup of global configuration file

When GUIFold is installed in a shared location it is recommended to create a global configuration file so that the users don't have to configure the paths on their own. When a user starts the app for the first time and a configuration file exists in the expected location, the parameters will be automatically transferred to the database of the user (stored in the home directory of the user). The user can change settings in the GUI later on. Open the file GUIFold/guifold/config/template.conf and adapt it to your local environment. Further explanations of the different parameters are given as comments in the file. After editing, save the file to GUIFold/guifold/config/guifold.conf. It is important to use this specific name and location otherwise it will not be loaded. When the global configuration needs to be changed later on, the users can re-load it in the Settings dialog of the GUI.

Re-install the package if it has been installed before:

(conda activate /path/to/conda_env)
cd GUIFold
python setup.py clean --all install clean --all

Setup of MMseqs2 and colabfold databases (OPTIONAL)

MMSeqs2 installation

MMSeqs is not automatically installed. It can be easily added to the conda environment with

(conda activate /path/to/conda_env)
conda install -c conda-forge -c bioconda mmseqs2

or installed in diffrent ways as described in MMseqs2 documentation

Database setup

If you want to use the colabfold protocol you also need to download "uniref30_2202" and "colabfold_envdb_202108" (available at Link). It is required to convert these databses to expandable profile databases and generate database indices (see MMseqs2 documentation).

mmseqs createindex is used to create database indices. The --split 0 flag will automatically determine the number of splits based on the available RAM. Therefore this step should be run on the machine where Alphafold is later run (or adjusted to the minimal available RAM with --split-memory-limit in addition to the --split 0 flag). --threads can be used for parallelisation. More details in MMseqs2 documentation.

  1. Go to the uniref90 database directory (which contains uniref90.fasta) and run
mmseqs createindex uniref90 tmp --split 0

In the global configuration file (see below) the uniref90_mmseqs path needs to point to your_directory_with_databases/uniref90/uniref90

  1. Go to the uniprot database directory (which contains uniprot.fasta) and run
mmseqs createindex uniprot tmp --split 0

In the global configuration file the uniref90_mmseqs path needs to point to your_directory_with_databases/uniprot/uniprot

  1. Go to the colabfold_envdb directory and run
mmseqs tsv2exprofiledb colabfold_envdb_202108 colabfold_envdb_202108_db
mmseqs createindex colabfold_envdb_202108_db tmp --split 0

In the global configuration file the colabfold_envdb path needs to point to your_directory_with_databases/colabfold_envdb_202108_db/colabfold_envdb_202108_db

  1. Go to the uniref30_2202 directory and run
mmseqs tsv2exprofiledb uniref30_2202 uniref30_2202_db
mmseqs createindex uniref30_2202_db tmp --split 0

In the global configuration file the uniref30_mmseqs path needs to point to your_directory_with_databases/uniref30_2202/uniref30_2202

Species database

Create database for accession to species identifier mapping:

To use the standard Alphafold protocol for MSA pairing, GUIFold currently needs to create a database which maps accession to species identifiers. This database will be automatically created when the colabfold_local or colabfold_web protocols are used for the first time. If installation is done for a multi-user environment it is recommended to run this as part of the installation.

  1. Make sure the global configuraiton file is properly setup (esp. the uniprot database path needs to be defined)
  2. Open the GUI (see Usage)
  3. Paste any random sequence in FASTA format in the sequence input
  4. Click Read sequence button
  5. Select colabfold_local from the Feature pipeline dropdown menu
  6. Click Run button
  7. In the Log tab after some initialization, you should see the lines Creating database.... This step can take up to a few hours depending on hardware.

Setup of cluster submission template

The Jinja2 package is used to render the submission script template. See Jinja2 documentation for further information. The variables listed below can be used to create a template. See also examples below.

The template needs to be saved to GUIFold/guifold/templates/submit_script.j2. In the same folder you can find an example for a SLURM cluster.

After saving the template to the above location, re-install the package if it has been installed before:

(conda activate /path/to/conda_env)
cd GUIFold
python setup.py clean --all install clean --all

Automatically splitting a job into CPU and GPU parts

If the queueing system supports dependencies (i.e. a job waits in the queue until another job has finished), the "split job feature" can be activated in the GUI settings if needed. Since the feature generation step does not require GPU, this step can be run on CPU-only resources. Two jobs will be submitted, the first job will request CPU (use_gpu=False) and the second job (use_gpu=True) will wait for the first job to finish (if the dependency is configured). An example how to add a dependency for SLURM and how to create a conditional to request CPU or GPU resources is provided below. Alternatively, the job can be manually devided into CPU and GPU steps by choosing Only Features in the GUI and, after this job has finished, re-starting the job with Only Features deactivated.

Variables that can be accessed in the submission template

GUIFold supports the following variables that can be used in the submission template. The parameters are determined based on the input and settings (configuration file or Settings dialog in the app):

{{logfile}} (required) Path to the log file
{{account}} (optional) When a specific account is needed to run jobs on the cluster
{{use_gpu}} (optional) This can be used to build a conditional (example below) to select CPU or GPU nodes/queues
{{mem}} (optional) How much RAM (in GB) should be reserved. The RAM will be automatically increased with the GPU memory for unified memory.
{{num_cpu}} (optional) Number of CPUs to request {{num_cpu}} (optional) Number of GPUs to request (only relevant for FastFold) {{total_sequence_length}} (optional) Total sequence length (not accounting for identical sequences)
{{gpu_mem}} (optional) Useful when the queuing system supports selection of GPU by memory requirement. Value in GB.
{{split_mem}} (optional) If the required memory exceeds the available GPU memory, the job can be run with unified memory. The split_mem variable holds None or the memory split fraction and can be used for a conditional to set the FLAGS required to enable unified memory use (see SLURM example below).
{{add_dependency}} (required) When the job is started with "split job setting", this variable will be True for the second job (prediction step) and allows adding a dependency on the first job (feature step).
{{commnad}} (required) The command to run the AlphaFold job

To cancel jobs from the GUI, the script also needs to write the Job ID to the logfile. The pattern needs to be as follows:
echo "QUEUE_JOB_ID=$JOB_ID_VARIABLE_FROM_QUEUING_SYSTEM"
In case of SLURM it would be:
echo "QUEUE_JOB_ID=$SLURM_JOB_ID"

The number of CPUs should be set to 16 (at maximum 1 jackhmmer and 1 hhblits jobs are run in parallel, each set to use 8 CPUs in the respective alphafold.data.tools classes).

Example of a template for a SLURM cluster.

The cluster in the example below has two types of GPUs, V100 (32 GB) and A100 (80 GB). The variable gpu_mem can be used to build conditionals for choosing the appropriate GPU.

#!/bin/bash
#SBATCH --account={{account}}
#SBATCH --job-name=alphafold
#SBATCH --cpus-per-task={{num_cpu}}
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --output={{logfile}}
#SBATCH --error={{logfile}}
#Append to logfile
#SBATCH --open-mode=append
#SBATCH --mem={{mem}}G


{% if use_gpu %}
#SBATCH --gres=gpu:1
{% if add_dependency %}
#If "Split Job" is selected in the GUI, add_dependency will be True for the second job and create a dependency on the first job (CPU-only)
#SBATCH --dependency=afterok:{{queue_job_id}}
#SBATCH --kill-on-invalid-dep=yes
{% endif %}
{% if gpu_mem|int <= 31 %}
#Select appropriate GPUs by e.g. constraint, nodename or gpu_name
#SBATCH --constraint=
#SBATCH --partition=
{% elif gpu_mem|int > 31 %}
#Select GPUs with > 31 GB memory by e.g. constraint, nodename or gpu_name
#SBATCH --constraint=
#SBATCH --partition=
{% endif %}

{% else %}
#If job only needs CPU
{% if total_sequence_length|int > 2000 %}
#SBATCH --partition=
{% else %}
#SBATCH --partition=
{% endif %}
{% endif %}


{% if split_mem %}
#If job needs to run with unified memory
export TF_FORCE_UNIFIED_MEMORY=True
export XLA_PYTHON_CLIENT_MEM_FRACTION={{split_mem}}
{% endif %}

echo "QUEUE_JOB_ID=$SLURM_JOB_ID"

module load ... (or conda activate ...)
{{ command }}

Setup of a modulefile

Instead of activating the conda env you can also create an environment modulefile for production use.

Minimal example:

#%Module1.0

setenv       ALPHAFOLD_CONDA            /path/to/guifold/af-conda
prepend-path PATH                       $env(ALPHAFOLD_CONDA)/bin
prepend-path LD_LIBRARY_PATH            $env(ALPHAFOLD_CONDA)/lib
prepend-path LD_LIBRARY_PATH            $env(ALPHAFOLD_CONDA)/x86_64-conda-linux-gnu/sysroot/usr/lib64/
prepend-path PYTHONPATH                 $env(ALPHAFOLD_CONDA)/lib/python3.8/site-packages
prepend-path PYTHONPATH                 $env(ALPHAFOLD_CONDA)/lib/python3.8

Usage

When the conda env is activated (conda activate /path/to/af-conda) or added to PATH, LD_LIBRARY_PATH and PYTHONPATH (see Setup of a module file you can start GUIFold by typing:
afgui.py

To re-run an evaluation go to the job folder (where the FASTA sequence is stored) and type
afeval.py --fasta_path name_of_sequence.fasta

See Wiki for more detailed documentation.

Licenses

GUIFold is licensed under the Apache License, Version 2.0.

Icons are from the GTK framework, licensed under GPL.

The modified AlphaFold code retains its original license. See (https://github.com/deepmind/alphafold)

Third-party software and libraries may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.

Citations

Some features were inspired by other projects and implemented from scratch if not indicated in the code.

AlphaPulldown Colabfold Alphafold