Skip to content

johnhalloran321/datum

Repository files navigation

DATUM - DBN Analysis Toolkit to Understand Mass spectra
===================================================
DATUM is a toolkit consisting of several dynamic Bayesian networks (DBNs) 
and associated software tools to analyze tandem mass (MS/MS) spectra.  The toolkit
supports effective parameter learning, database search, and post-search feature 
extraction for Percolator analysis.  Written in Python and ported from the DRIP
Toolkit (DTK), DATUM adds sveral new modules for Didea, a DBN whose posterior-based
scoring function may be thought of as a probabilistic analogue to the popular XCorr 
scoring function found in SEQUEST/Comet/CRUX.  Training and searching with Didea is 
both faster and more accurate than using DRIP, while also offering training 
convergence guarantees not possible with DRIP (i.e., Didea learning converges to a 
global optimum while DRIP learning only converged to a local optima).  If you use
the Didea features offered by DATUM in your research, please cite:

John T. Halloran and David M. Rocke.
"Learning Concave Conditional Likelihood Models for Improved Analysis of 
Tandem Mass Spectra."  Advances in Neural Information Processing Systems (NIPS) 2018.

---------------------------------------------------
----------------- DATUM Installation
---------------------------------------------------
If you are only using the Didea modules (i.e., digest.py, dideaTrain.py, 
dideaSearch.py), no extra software aside from Python 2.7 is required and the modules
will work after downloading.  For information on using DRIP, see below.

---------------------------------------------------
----------------- Note
---------------------------------------------------
The Didea modules of DATUM are currently under rapid development and thus in their 
early stages.  DATUM will be updated shortly as new Didea features are added (i.e., 
support for high-res MS2 search, parameter learning with peptide modifications, 
feature extraction, documentation, etc.).

===================================================
==========  Info for DRIP Tools
===================================================
The DRIP Toolkit utilizes a dynamic Bayesian network (DBN) 
for Rapid Identification of Peptides (DRIP) in tandem mass spectra.  
Given an observed spectrum, DRIP scores a peptide by aligning the 
peptide's theoretical spectrum and the observed spectrum, i.e., 
computing the most probable sequence of insertions (spurious 
observed peaks) and deletions (missing theoretical peaks).  
DBN inference is efficiently performed utilizing the Graphical Models 
Toolkit (GMTK), which allows easy alterations to the model. If you use 
the DRIP toolkit in your research, please cite:

John T. Halloran, Jeff A. Bilmes, and William S. Noble.
"Dynamic bayesian network for accurate detection of peptides from tandem mass spectra." 
Journal of proteome research 15.8 (2016): 2749-2759.

Written by John T. Halloran ([email protected]).

---------------------------------------------------
----------------- Installation
---------------------------------------------------
The toolkit requires the following be installed:
Cygwin (if using Windows)
g++
the Graphical Models Toolkit (GMTK) - https://melodi.ee.washington.edu/gmtk/
Python 2.7
argparse and numpy python packages
SWIG

After installing the above, perform the following in the unzipped toolkit directory:
cd pyFiles/pfile
swig -c++ -python libpfile.i
CC=g++ python setup.py build_ext -i

Assuming no errors were output, the DRIP Toolkit 
is now ready for use!  To test that the above was
compiled and linked correctly, run:
./test.py

---------------------------------------------------
----------------- Searching ms2 files
---------------------------------------------------
For convenience, sample data is included in directory data and
example DRIP Toolkit search commands are provided in test.sh.

To search an MS2 dataset data/test.ms2 given FASTA file data/yeast.fasta, perform the following steps:
1.) Digest the FASTA file using dripDigest.py.  Example:
python dripDigest.py \
    --digest-dir dripDigest-output \
    --min-length 6 \
    --fasta data/yeast.fasta \
    --enzyme 'trypsin/p' \
    --monoisotopic-precursor true \
    --missed-cleavages 0 \
    --digestion 'full-digest'

The digested peptide database will be written to directory dripDigest-output

2.) Search using dripSearch.py.  Example:
python dripSearch.py \
    --digest-dir 'dripDigest-output' \
    --spectra data/test.ms2 \
    --precursor-window 3.0 \
    --learned-means dripLearned.means \
    --learned-covars dripLearned.covars \
    --num-threads 8 \
    --top-match 1 \
    --high-res-ms2 F \
    --output dripSearch-test-output

If the data was collected utilizing high-resolution fragment ions, set --high-res-ms2 T.  The output PSMs
will be written to file dripSearch-test-output.txt.

For a detailed explanation of using the toolkit, including training DRIP, 
preparing data for cluster usage, and speeding up a search using approximate 
inference, please consult:
https://jthalloran.bitbucket.io/dripToolkit/documentation.html

For a full list of allowable toolkit options, please consult:
https://jthalloran.bitbucket.io/dripToolkit/dripDigest.html
https://jthalloran.bitbucket.io/dripToolkit/dripSearch.html
https://jthalloran.bitbucket.io/dripToolkit/dripTrain.html

---------------------------------------------------
----------------- Contaact
---------------------------------------------------
Please send all questions and bug reports to:
[email protected]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages