In the suitation when an assay library is not available, we choose Diamond's library-free mode, as the blue dashed box shown in the picture above, an assay library will be generated first and in the suitation when an assay library is available, we choose the Diamond's library-based mode, as the green dotted box shown in the picture above, the library building step will be skipped.
First, please execute the following command in your terminal (PowerShell, if your machine is based on Windows system) to clone the Diamond repository from my GitHub to your own machine.
git clone https://github.com/xmuyulab/Diamond.git
Then , download the example MS data. Provided here are the three mzXML files in profile mode in SWATH-MS Gold Standard (SGS) data of yeast, which are available from the PeptideAtlas raw data repository with accession number PASS00289 and the three mzXML files in centroid mode, which can be obtained by preprocessing the profile data with ProteoWizard.
(1) Three profile data files: please visit PASS00289, click on the link "ftp://PASS00289:[email protected]/" at the bottom of the page, select the three files napedro_L120228_00{1,2,3}_SW.mzXML.gz
under the /SGS/mzxml
folder, download and store them in the /Diamond/data/profile
folder. Note that the profile files are in a compressed format, so execute the following commands to decompress them.
cd /path/to/Diamond/data/profile
gunzip ./napedro_L120228_00{1,2,3}_SW.mzXML.gz
(2) Three centroid data files: please visit cMS01, cMS02, cMS03 respectively, download and store them in the /Diamond/data/centroid
folder. Note that the centroid files are in a compressed format, so execute the following commands to decompress them.
cd /path/to/Diamond/data/centroid
gunzip ./napedro_L120228_00{1,2,3}_SW.mzXML.gz
(3) The library file, irt file, windows file and database file have been stored in the /Diamond/data/
folder. Note that the library file and the irt file are in a compressed format, so execute the following commands to decompress them.
cd /path/to/Diamond/data
gunzip ./library.TraML.gz
gunzip ./irt.TraML.gz
After all the data is ready, an example tree structure diagram of the /Diamond/data
folder is as follows:
Diamond is containerized by Docker into an image, the installation tutorial of Docker is described in the Docker documentation (both for Linux and Windows). On your machine, please start a Terminal (PowerShell) session and then execute the following command within the console:
docker pull zeroli/diamond:1.0
This will take a few minutes to pull the Diamond image from Docker Hub to your machine. You can check whether the image zeroli/diamond:1.0
is successfully pulled by executing docker images
, and if successfully, it will appear in the images list.
Create a container (named diamond_test) based on the image zeroli/diamond:1.0
and simultaneously mount the local folder /path/to/Diamond
to the folder /mnt/Diamond
(in the container) by running the following command in your terminal:
docker run -it --name diamond_test -v /path/to/Diamond/:/mnt/Diamond/ zeroli/diamond:1.0 bash
After the above command is executed, you will enter the container. Please switch to the folder /mnt/Diamond
by executing cd /mnt/Diamond
in your terminal.
Note: Type in exit
and press Enter
, or hit Ctrl+D
to exit the container. To re-enter the container after exiting, please follow the commands below :
docker start diamond_test
docker exec -it diamond_test bash
The Nextflow script is saved as a pipeline.nf
file in the Diamond
folder. Diamond's two modes: library-free and library-based execution commands are as follows.
Execute the following command in your terminal to start the analysis of MS data with the aim to build an assay library:
nextflow run /mnt/Diamond/pipeline.nf --workdir "/mnt/Diamond" --centroid "/mnt/Diamond/data/centroid/*.mzXML" --profile "/mnt/Diamond/data/profile/*.mzXML" --fasta "/mnt/Diamond/data/sgs_yeast_decoy.fasta" --windows "/mnt/Diamond/data/win.tsv.32" --windowsNumber "32"
Note: This step will take about two hours. The MS data processing intermediate results will be stored in the folder named /mnt/Diamond/results
by default. The final peptide identification results are saved in the file named aligned.tsv
. Please refer to the Help Message section or execute nextflow run /mnt/Diamond/pipeline.nf --help
in the container to view the detailed information of parameter passing.
Execute the following command in your terminal to start the analysis of MS data by providing an assay library:
nextflow run /mnt/Diamond/pipeline.nf --skipLibGeneration --workdir "/mnt/Diamond" --profile "/mnt/Diamond/data/profile/*.mzXML" --lib "/mnt/Diamond/data/library.TraML" --irt "/mnt/Diamond/data/irt.TraML" --windows "/mnt/Diamond/data/win.tsv.32" --outdir "/mnt/Diamond/results_library_based"
Note: This step will take about ten minutes. The --skipLibGeneration
parameter means the process of building an assay library will be skipped. The --outdir
parameter specifies the storage location of the data processing intermediate results (Default: /mnt/Diamond/results
). The final peptide identification results are saved in the outdir folder, named aligned.tsv
. Please refer to the Help Message section or execute nextflow run /mnt/Diamond/pipeline.nf --help
in the container to view the detailed information of parameter passing.
Two different execution-commands for the two different modes of Diamond. This help message can also be obtained by executing the following command in the container:
nextflow run /mnt/Diamond/pipeline.nf --help
nextflow run /mnt/Diamond/pipeline.nf --workdir "" --centroid "" --profile "" --fasta "" --windows "" --windowsNumber "" <Options_library_free> <Functions>
nextflow run /mnt/Diamond/pipeline.nf --skipLibGeneration --workdir "" --profile "" --lib "" --irt "" --windows "" <Options_library_based> <Functions>
parameters | descriptions |
---|---|
--workdir | Specify the location of the Diamond folder. For example: --workdir "/path/to/Diamond" (Do not contain a slash at the end!) |
--centroid | Deliver centroided data. For example: --centroid "/path/to/Diamond/data/centroid/*.mzXML" |
--profile | Deliver profile data. For example: --profile "/path/to/Diamond/data/profile/*.mzXML" |
--fasta | Deliver the database file. For example: --fasta "/path/to/Diamond/data/sgs_yeast_decoy.fasta" |
--windows | Deliver the windows file. For example: --windows "/path/to/Diamond/data/win.tsv.32" |
--windowsNumber | Deliver the number of the windows to select a suitable parameter file for DIA-Umpire. For example: --windowsNumber "32" |
--irt | Deliver a transition file containing RT normalization coordinates. For example: --irt "/path/to/Diamond/data/irt.TraML" |
--lib | Deliver a ready-made assay library. For example: --lib "/path/to/Diamond/data/library.TraML" |
--skipLibGeneration | The parameter means the step of building an assay library will be skipped and Diamond's library-based mode will be choosed. No need to give a specific parameter. |
parameters | descriptions |
---|---|
--outdir | Specify a results folder. For example: --outdir "/path/to/Diamond/outputs" (Do not contain a slash at the end! Default: the folder named results under the workdir) |
--diau_paraNumber | Specify the maximum number of parallel data processing of DIA-Umpire (Default: "4"). |
--mgf_mzML_paraNumber | Specify the maximum number of parallel data processing for file format conversion (Default: "4"). |
--mzML_part_paraNumber | Specify the maximum number of parallel data processing for dividing mzML files (Default: "4"). |
--comet_paraNumber | Specify the maximum number of parallel data processing of Comet searching (Default: "4"). |
--tandem_paraNumber | Specify the maximum number of parallel data processing of X!Tandem searching (Default: "20"). |
--merge_paraNumber | Specify the maximum number of parallel data processing for merging searching results (Default: "9"). |
--xinteract_paraNumber | Specify the maximum number of parallel data processing for xinteract (Default: "30"). |
--min_decoy_fraction | Specify the minimum fraction of decoy / target peptides and proteins for OpenSwathDecoyGenerator (Default: "0.8"). |
--openSWATH_paraNumber | Specify the maximum number of parallel data processing for openSWATH (Default: "4"). |
--min_rsq | Specify the minimum r-squared of RT peptides regression for OpenSwathWorkflow (Default: "0.95"). |
--pp_paraNumber | Specify the maximum number of parallel data processing for PyProphet (Default: "9"). |
--fdr | The threshold of FDR control (Default: "0.01"). |
--pp_score_statistics_mode | The parameter option of PyProphet (Default: "global"). You can modify it to "local" or "local-global". |
--pp_score_lambda | The lambda value for storeys method (Default: "0.4"). |
Note: We process the MS data on a machine with a 64-core CPU and 256G memory. The greater the number of parallel data processing, the higher the memory and CPU resources consumed. If the memory is insufficient, you can appropriately reduce the number of parallel data processing.
parameters | descriptions |
---|---|
--outdir | Specify a results folder. For example: --outdir "/path/to/Diamond/outputs" (Do not contain a slash at the end! Default: the folder named results under the workdir) |
--openSWATH_paraNumber | Specify the maximum number of parallel data processing for openSWATH (Default: "4"). |
--min_rsq | Specify the minimum r-squared of RT peptides regression for OpenSwathWorkflow (Default: "0.95"). |
--pp_paraNumber | Specify the maximum number of parallel data processing for PyProphet (Default: "9"). |
--fdr | The threshold of FDR control (Default: "0.01"). |
--pp_score_statistics_mode | The parameter option of PyProphet (Default: "global"). You can modify it to "local" or "local-global". |
--pp_score_lambda | The lambda value for storeys method (Default: "0.4"). |
These parameters are built-in functions of Nextflow, they can generate some visual graphics, which or show the total time consumption of the pipeline, or show the time consumption, memory occupation, cpu usage of each process. Interested can add these parameters to observe relative information.
parameters | descriptions |
---|---|
-with-timeline | It renders a timeline.html file that records the time, memory consumption of different processes. |
-with-report | It generates a report.html file that records the single core CPU Usage, execution time, memory occupation and Disk read write information of different processes. |
-with-trace | It creates an execution tracing file that contains some useful information about each process executed in your pipeline script, including: submission time, start time, completion time, cpu and memory used. |
-with-dag | It outputs the pipeline execution DAG. It creates a file named dag.dot containing a textual representation of the pipeline execution graph in the DOT format. |
-resume | It means only the processes that are actually changed will be re-executed. The execution of the processes that are not changed will be skipped and the cached result used instead. Also, the pipeline can be restarted by add the parameter when any disconnection of the network or server occurs. |
Please cite this article.