Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection | |
Ehsaneddin Asgari, Philipp C Münch, Till R Lesker, Alice C McHardy, Mohammad R K Mofrad; DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection, Bioinformatics, , bty954, https://doi.org/10.1093/bioinformatics/bty954
Developer: Ehsaneddin Asgari (asgari [at] berkeley [dot] edu)
Please feel free to report any technical issue by sending an email or reporting an issue here.
Project page: http://llp.berkeley.edu/ditaxa
PIs: Prof. Alice McHardy* and Prof. Mohammad Mofrad*
Summary Identifying distinctive taxa for microbiome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of microbiome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard OTU-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis, and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets.
Please cite the Bioinformatics paper
@article{10.1093/bioinformatics/bty954,
author = {Asgari, Ehsaneddin and Münch, Philipp C and Lesker, Till R and McHardy, Alice C and Mofrad, Mohammad R K},
title = "{DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection}",
year = {2018},
month = {11},
doi = {10.1093/bioinformatics/bty954},
url = {https://dx.doi.org/10.1093/bioinformatics/bty954},
eprint = {http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/bty954/27452903/bty954.pdf},
}
For the detailed installation using conda virtual environment and testing the working example please refer to the installation guideline .
An example of periodontal disease dataset (Jorth et al, 2015) is provided in the repo. In order to see how DiTaxa runs, you may run the following command after installation.
python3 ditaxa.py --indir dataset/periodontal/
--fast2label dataset/periodontal/mapping.txt
--ext fastq
--outdir results_dental/
--dbname periodontal
--cores 20
--phenomap diseased:1,healthy:0
--heatmap PeriodontalSamples:HealthySamples
--phenoname DvsH
--override 1
(optional)--blastn BLASTN_PATH
Alternatively you can run:
bash ./run_test.sh
The "indir": e.g. «dataset/periodontal/» contains fastq files for each 16S rRNA samples.
The "fast2label"" e.g. «dataset/periodontal/mapping.txt» provides a file containing mapping from fastq files to their labels in a tabular format:
d1.fastq diseased
d2.fastq diseased
d3.fastq diseased
d4.fastq diseased
d5.fastq diseased
d6.fastq diseased
d7.fastq diseased
d8.fastq diseased
d9.fastq diseased
d10.fastq diseased
h1.fastq healthy
h2.fastq healthy
h3.fastq healthy
h4.fastq healthy
h5.fastq healthy
h6.fastq healthy
h7.fastq healthy
h8.fastq healthy
h9.fastq healthy
h10.fastq healthy
The "phenomap", e.g. «diseased:1,healthy:0» determining which labels to be considered as positive class and which as negative class as a string with no space in the following format:
diseased:1,healthy:0
The "override", 1 will override already existing files in the directory.
The "heatmap" e.g. «PeriodontalSamples:HealthySamples» determines the names for plotting positive and negative pheotypes on the heatmap.
The "blastn", optional: only if you don't run build.sh you need to specify this: This is the path to the "bin" directory of blast existing on your system. In this case, you may get the latest version of blast for your operating system from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).
After running this command the output files will be generated in 'results_dental' as described bellow. The example output files are provided in the './output_example/' directory.
The automatically generated output of the example is as follows, you may also see automatically generated files in `output_example`: DiTaxa provides a taxonomic tree for significant discriminative biomarkers, where identified taxa to the positive and negative class are colored according to their phenotype (red for positive class and blue for negative class). The DiTaxa implementation for taxonomic tree generation uses a Phylophlan-based backend. Other than PDF files, DiTaxa provides the raw graphlan outputs to facilitate further annotations. STDP here means the state-of-the-art OTU based approach compared to DiTaxa.DiTaxa provides a heatmap of top biomarkers occurrences in samples, where the rows denote markers and the columns are samples is generated. Such a heatmap allows biologists to obtain a detailed overview of markers' occurrences across samples. The heatmap shows number of distinctive sequences hit by each biomarker in different samples and stars in the heatmap denote hitting unique sequences, which cannot be analyzed by OTU clustering approaches.
In addition, DiTaxa provides a detailed excel file of biomarker sequnces and their taxonomy annotations along with their p-values. T-sne visualization of data using all NPEs and selected markers will be also generated by default.After installation using the installation guideline .you may use DiTaxa The parameteres for running DiTaxa are as follows:
python3 ditaxa.py --indir address_of_samples --ext extension_of_the_files --outdir output_directory --dbname database_name --cores 20 --fast2label mapping_file_from_name_to_phenotype --phenomap mapping_labels_to_binary_1_or_0_phenotype
--blastn /mounts/data/proj/asgari/dissertation/deepbio/taxonomy/ncbi-blast-2.5.0+/bin/
Using the above mentioned command all the steps will be done sequentially and output will be organized in subdirectories.
--indir: The input directory containing all fasta or fastq files. (e.g.: datasets/periodontal/)
--ext: Sequence file extensions (fasta or fastq) (e.g.: fastq)
--outdir: The output directory (e.g.: /mounts/data/ditaxa/results/test_dental_out/)
--cores: Number of cores (e.g.: 40)
--fast2label: tabular mapping file between file names and the labels
--phenomap: mapping from label to binary phenotypes
--phenoname: name of the phenotype mapping, if not given the labels and their value will be used for identification: label1@1#label2@1...#label3@0. Please note that a single project may have several phenotype mapping schemes (untreated diseased versus all or untreated versus healthy or etc.)
--override: 1 to override the existing files, 0 to only generate the missing files
--heatmap: generates occurrence heatmap for the top 100 markers (e.g: positive_title:negative_title).
--excel: 1 or 0, the default is 1 to generate a detailed list of markers, their taxonomic assignment, and their p-values
--blastn: If you have already run './build.sh' you do not need to specify this parameter and the script will download it and put the
NCBI BLASTN /bin/ path in your system. Otherwise, if you already have this on your system you can specify it here.
You can also download blast+ from below and specify the path:
Linux
http://ftp.ncbi.nlm.nih.gov/blast/executables/blast%2B/2.7.1/ncbi-blast-2.7.1%2B-x64-linux.tar.gz
MacOSx
http://ftp.ncbi.nlm.nih.gov/blast/executables/blast%2B/2.7.1/ncbi-blast-2.7.1%2B-x64-macosx.tar.gz
For phenotype classification functionality, evaluation a 10XFold cross-validation framework:
--classify: which predictive model to use: choices=[False: default, 'RF': random forest, 'SVM': support vector machines, 'DNN': deep multi-layer perceptron, 'LR': logistic regression]
Deep neural network parameters
Although a full script is provided, in order to simplify the core installation of DiTaxa for biomarker detection/analysis we have commented the deep neural network classifier and its dependencies. In case you are interested in using neural network prediction of the phenotype you only need to install some further dependencies (keras/tensorflow) and uncomment "import DNN" in main/DiTaxa.py.
--arch: The comma separated definition of neural network layers connected to eahc other, you do not need to specify the input and output layers, values between 0 and 1 will be considered as dropouts, e.g., 1024,0.2,512'
--batchsize
--gpu_id: which GPU to use
--epochs: Number of epochs