CANDIA is a GPU-powered unsupervised multiway factor analysis framework that deconvolves multispectral scans to individual analyte spectra, chromatographic profiles, and sample abundances, using the PARAFAC (or canonical decomposition) method. The deconvolved spectra can be annotated with traditional database search engines or used as a high-quality input for de novo sequencing methods.
Parallel Factor Analysis Enables Quantification and Identification of Highly Convolved Data-Independent-Acquired Protein Spectra
Filip Buric, Jan Zrimec, Aleksej Zelezniak, Cell Patterns (2020); DOI: https://doi.org/10.1016/j.patter.2020.100137
This repository holds the CANDIA scripts that produced the results in the submitted article
(in the revision with the tag "submission").
Restructuring the pipeline and making it easy to use is a work in progress.
The repository will be updated but the submission
revision is available as a snapshot of the source code
at the time of manuscript submission.
For the decomposition, an NVIDIA GPU is required. Models known to perform well: K80, GP100, and V100. Minimum GPU RAM: 8 GB. Recommended: >= 16 GB.
The preprocessing and downstream stages should perform acceptably with as few as 8 CPUs at 2.5 GHz and 16 GB RAM. Recommended: >= 16 CPUs at >= 3 GHz, and >= 32 GB RAM.
The current CANDIA distribution is split in two components:
- the scripts that make up the pipeline (clone this repo to fetch them)
- a Singularity container which includes all dependencies for running CANDIA (Python libraries and other necessary third-party software)
We recommend using the Singularity container provided on Singularity Hub (built with Singularity version 3.4.2). The container should be placed inside the cloned CANDIA repo. To use the container:
- Install Singularity 3.x on your system
- Pull the container with the command
singularity pull candia.sif shub://fburic/candia:def
This will download the containercandia.sif
(4.4 GB) inside the current working directory. The maincandia
script assumescandia.sif
is located in the CANDIA repo.
Alternatively, the container may be built from the supplied candia.def
file
(requires root permissions, see instructions here).
The bulk of dependencies is managed through the conda packaging system.
The container simply encapsulates such an enviroment to ensure portability, but
a conda environment can also be built
from scratch on the user's system with the provided candia_env.yaml
specification file.
Here, commands will be shown using the Singularity container.
CANDIA was developed and used on POSIX systems: Ubuntu 18.04 and CentOS 7.8 Linux, as well as macOS 10.13 (High Sierra), 10.14 (Mojave), 10.15 (Catalina). By using the Singularity container, CANDIA should be runnable on any OS that supports Singularity.
The Singularity container (and associated conda environment) currently includes:
- Crux 3.2 (package
crux-toolkit
,bioconda
channel) - TPP 5.0.0 (package
tpp
,bioconda
channel) - msproteomicstools 0.11.0 (
bioconda
channel)
The software below are either not distributed through package management archives or those versions did not work with this setup.
MS-GF+ comes in the form of a Java JAR package and it only needs to be unzipped into
a directory (e.g. /home/$USER/software
).
Installation instructions: https://github.com/MSGFPlus/msgfplus
The environment variable MSGF_JAR_PATH
needs to be set to inform the pipeline
of the .jar
location. Add it to your .bashrc
or .profile
. E.g.
export MSGF_JAR_PATH="$HOME/software/MSGFPlus/MSGFPlus.jar"
To perform quantification with the generated CANDIA library, we provide a wrapper script
for the Linux version of DIA-NN. The the user may install diann-linux
from https://github.com/vdemichev/DiaNN
(included in the "source code" package of each release).
CANDIA is known to work with DIA-NN 1.7.4
The wrapper scripts/quantification/diann.Snakefile
is simply a convenience script that supplies all necessary parameters,
and ensures resuming on error.
Once the library is created, the user may run DIA-NN independently of CANDIA or the Singularity workflow.
Note: The tpp
version 5.0.0-0
bioconda package that also includes Mayu may give missing error libraries.
The user can install a stand-alone version as described here.
Mayu is in the form of Perl scripts, so it only needs to be unzipped to be used. http://proteomics.ethz.ch/muellelu/web/LukasReiter/Mayu/
It was designed to be executed from its installation directory
(e.g. /home/$USER/software/Mayu
) so the environment variable MAYU_STANDALONE_PATH
needs to be set up to point at this location, so the pipeline may run it from anywhere.
Add it to your .bashrc
or .profile
. E.g.
export MAYU_STANDALONE_PATH="$HOME/software/Mayu"
The pipeline currently consists of a collection of scripts for the different stages of processing.
The shell script candia
is provided to run all pipeline steps from preprocessing
up to and including quantification library creation.
However, this script is still a work in progress and is currently mainly used to test that the pipeline
can execute on the user's system.
This can be done by running the following from CANDIA's top level directory:
./candia test/test_experiment/config/candia.yaml
The general syntax of this script is candia EXPERIMENT_CONFIG_YAML
Note The test experiment will only run properly up to and including the decomposition step. As this is the most complex part of the pipeline, this should guarantee that your installation works for real data. A more complete test suite (more realistic data and unit tests) is planned.
Currently, it is recommended to run each stage at a time, using the corresponding scripts. Step-by-step instruction are listed here using the toy data files provided in this repo. These data files are primarily meant to test the pipeline is working properly.
Expected running times are given as a very rough estimate on 9 real DIA scan files, to indicate the time scale of the steps. These times will of course vary depending on the available resources. Actual running times for the test experiment should be exceedingly short.
We recommend running the pipeline on a workstation with multiple cores or
a high-performance computing (HPC) environment. Currently, the HPC manager slurm
is supported. A conda environment is also assumed to be present on the cluster.
A Singularity recipe to encapsulate the environment is provided.
An NVIDIA GPU card is required for the decomposition. Partial support for CPU-only execution is implemented but the performance becomes infeasible.
The execution of the pipeline is configured through a YAML file
(e.g. test_experiment/config/candia.yaml
).
This configuration file specifies the location of input and intermediate files,
as well as algorithm parameters. The paths are interpreted as relative to the
experiment directory root_dir
.
It is recommended to create separate configuration files for each parameter variation to be included in final results, rather than changing a single configuration file. Different configuration files may also be created for each stage of the pipeline. The important thing is to supply the relevant parameters to each script.
Please read through the next steps to see which parameters are relevant at each step. A full configuration file is provided in the test experiment. More information may be found in the CANDIA article Supplemental Information.
The scripts are expected to be run from inside the CANDIA repo
(Thus first cd candia
after clonning).
Expected running time: 3-5 minutes per file
The basename of the input mzML files will be kept downstream, with only the extension changed.
Relevant pipeline config values:
samples_mzml
- location of input mzML filessamples_csv
- location of output CSV filesswath_windows
- file listing precursor isolation windowsmin_scan_intensity
- all values below this are dropped
Note: To set Snakemake to use N
cores, pass it the --jobs N
argument.
It should automatically use all available cores.
Commands:
configfile='test/test_experiment/config/candia.yaml'
singularity exec candia.sif \
snakemake -s scripts/util/mzml2csv.Snakefile --configfile ${configfile}
We cannot have overlapping isolation windows, so the bounds are adjusted.
Expected running time: 2 min per file
Relevant pipeline config values:
samples_csv
- location of CSV scan filessamples_adjusted_swaths
- location of output adjusted CSV filesswath_windows_adjusted
- file listing adjusted precursor isolation windows
Command:
singularity exec candia.sif \
snakemake --forceall -p -s scripts/util/adjust_swaths.Snakefile --configfile ${configfile}
Save the adjusted intervals for downstream tasks:
EXP_DIR=$(grep "root_dir:" ${configfile} | cut -d " " -f 2 | tr -d \")
cp $(find ${EXP_DIR} -name "*.intervals" | head -n 1) \
${EXP_DIR}/$(grep "swath_windows_adjusted:" ${configfile} | cut -d " " -f 2 | tr -d \")
The input scans are partitioned into swaths and RT windows of width window_size_sec
.
Note that this will create many small CSV files in slices_location/swath=value/rt_window=value
subdirectories.
Expected running time: 30 min
Relevant pipeline config values:
samples_adjusted_swaths
- location of output adjusted CSV fileswindow_size_sec
- the width of the RT windowsslices_location
- location of output slices
singularity exec candia.sif \
python scripts/util/split_csv_maps_to_slices.py --config ${configfile}
The slices are converted into NumPy (swath, rt, sample) tensors stored as .npy
files.
The Snakefile will start the conversions for each slice in parallel and each such job may
be submitted independently as a HPC cluster job.
Expected running time: 10 - 40 min (depending on available resources)
Relevant pipeline config values:
mass_tol_ppm
- The mass tolerance (in PPM) of the scan acquisition
The m/z precision is 10 decimals and m/z partitions with less than 5 time points in any sample are filtered out.
singularity exec candia.sif \
snakemake --jobs 4 -s scripts/util/generate_slice_tensors.Snakefile \
-R generate_slice_tensor --forceall --rerun-incomplete --configfile ${configfile}
On a slurm-managed cluster, this would be:
NJOBS=200
ACCOUNT_NUMBER=ABC123
singularity exec candia.sif \
snakemake --jobs ${NJOBS} -s scripts/util/generate_slice_tensors.Snakefile \
-R generate_slice_tensor --forceall --rerun-incomplete --configfile ${configfile} \
--cluster "sbatch -A ${ACCOUNT_NUMBER} -t 06:00:00 --ntasks 6"
A decomposition is performed on each slice tensor, for all number of components F
in the configured range. Multiple tensors are decomposed in parallel on each available GPU card.
If multiple GPUs ara available, the set of input tensors is partitioned evenly between the cards.
Note that while the scripts below require a GPU, the decomposition script itself
scripts/parafac/decompose_parafac.py
may be run on CPUs if executed without the --use_gpu
flag.
Relevant pipeline config values:
parafac_*
- decomposition parametersparafac_min_comp
- Lower bound of the number of componentsF
to decompose forparafac_max_comp
- Upper bound of the number of componentsF
to decompose for
Expected running time: 6 - 12 hours (depending on the GPU model)
The workstation command has the syntax:
scripts/parafac/decompose_workstation.sh EXPERIMENT_CONFIG_FILE N_PARALLEL_DECOMP_PER_GPU [SNAKEMAKE_OPTIONS]
Example running with Singularity (with 2 parallel decompositions per GPU):
scripts/parafac/decompose_workstation.sh ${configfile} 2
The cluster command has the syntax:
scripts/parafac/decompose_cluster.sh ${snake_configfile} N_PARALLEL_DECOMP_PER_GPU
And is designed to be run as an array job on slurm. Example (with 6 parallel decompositions per GPU):
sbatch --array=0-1 --gres=gpu:1 --time 24:00:00 \
scripts/parafac/decompose_cluster.sh "${configfile}" 6
If Singularity is not available, you can use the
decomopse_cluster_no_singularity.sh
script.
All PARAFAC models and components are indexed with a unique ID, as support for downstream tasks. These IDs, along with model filenames, are saved in two database (or "index") files, in Apache Feather format.
Expected running time: a few sec
Relevant pipeline config values:
model_index
- Model ID index filename (Apache Feather format)spectrum_index
- PARAFAC component ID index filename (Apache Feather format)
singularity exec candia.sif \
python scripts/parafac/models.py -c ${configfile}
Measure the unimodality of all PARAFAC components (for all models), then create a list of the best PARAFAC models, according to the unimodality criterion.
Expected running time: 10 sec
Relevant pipeline config values:
avg_peak_fwhm_sec
- The expected peak FWHM, to inform the peak finding procedure.window_size_sec
- Width of the RT windows (in seconds).time_modes_values
- Tabular file in Feather format that collects peak counts for all time modes (in all components), for all models generated by CANDIA across all slices.best_models
- CSV file where the best model parameter values are stored
First, collect the time modes and measure unimodality fractions.
singularity exec candia.sif \
python scripts/parafac/collect_time_mode_values.py -c ${configfile}
Then, select models:
singularity exec candia.sif \
Rscript scripts/parafac/select_best_models.R -c ${configfile}
Should it be relevant for downstream analyses, the sample mode values (relative abundances) of each of the decomposed spectra can be collected with the script below.
This will write the following files to the experiment directory:
sample_modes_best_models.feather
: The sample mode of the best mdoels, as a tabular file- A CSV file specified in the pipeline config file as
spectra_with_sample_abundance_file
containing the spetrum ID - sample abundance correspondence. The spectrum ID is the same as the<scan>
attribute ID in the CANDIA output mzXML file, so it can be used to match against output from downstream programs.
singularity exec candia.sif \
python scripts/quantification/collect_sample_modes.py --config ${configfile}
Developer note: The
feather
format (a table) can be read with the Python pyarrow library or the R arrow library
Configure which tool to use by setting the analysis_pipeline
variable to either "crux"
or "msgf+"
in the configuration file.
For example, in the provided test experiment, this is set to "crux"
A separate decoy sequence database must be provided besides the targets.
Expected running time: 30 min with Crux, 1 h with MS-GF+ (depends on the set number of modifications)
Relevant pipeline config values:
best_models_mzxml
- MzXML file where the spectra from the best models are concatenatedbest_models_crux_out_dir
- The directory for Crux results (Crux outputs multiple files)database
- FASTA protein sequence databasedecoy_database
- FASTA decoy sequence databasedecoy_prefix
- Decoy sequence ID prefix, which should include separator (e.g.decoy_
)msgf_modifications
- MS-GF+ modifications configuration file, if anymsgf_threads
- The number of MS-GF+ computation threads to use
Crux version 3.2 is included in the Singularity container (and associated conda environment)
singularity exec candia.sif \
python scripts/identification/id_models_concat.py -c ${configfile}
If running CANDIA with MS-GF+, the container may be executed with a supplied path environment variable specifying the location of the MS-GF+ program.
MSGF_JAR_PATH="$HOME/software/MSGFPlus/MSGFPlus.jar" singularity exec candia.sif \
python scripts/identification/id_models_concat.py -c ${configfile}
Expected running time: 2-5 min
The Schubert et. al (2015) protocol has been implemented and can be run as below. For a more detailed description, please see the Supplemental Information of the CANDIA paper. The relevant pipeline config values are also described there.
A mixed target-decoy is required. This is specified through the mixed_database
parameter
singularity exec candia.sif \
snakemake -s scripts/quantification/build_library.Snakefile --configfile ${configfile}
Note This protocol will fail for the supplied toy test data.
To quantify proteins using the generated CANDIA library, a wrapper script is provided for running DIA-NN.
The path to the diann-linux
binary should be supplied to the Singularity container (see command below).
Relevant pipeline config values:
samples_mzml
- The DIA scan filesquant_library
- The library file created by CANDIAdatabase
- The target protein sequence FASTA databasediann_out_dir
- Directory for DIA-NN outputdiann_report
- The filename of the TSV report output by CANDIA
Expected running time: 2-5 min per scan file
SINGULARITYENV_PREPEND_PATH=$HOME/software/diann singularity exec candia.sif \
snakemake -p -s scripts/quantification/diann.Snakefile --configfile ${configfile}
Configure which tool to use through the configuration file
Expected running time: 25 min with Novor, 10 min with DeepNovo (depends on GPU parameters)
TODO: expand, clarify
snakemake -s scripts/denovo/sequence_best_models.Snakefile --configfile ${configfile}