-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMicrobiomeWorkflowII.Rmd
2051 lines (1669 loc) · 88.2 KB
/
MicrobiomeWorkflowII.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Workflow for Microbiome Data Analysis: from raw reads to community analyses."
author:
- name: Benjamin J Callahan
affiliation: Department of Population Health and Pathobiology, NC State University, Raleigh, NC 27606
- name: Kris Sankaran
affiliation: Statistics Department, Stanford University, CA 94305
- name: Julia A Fukuyama
affiliation: Statistics Department, Stanford University, CA 94305
- name: Paul Joey McMurdie
affiliation: Whole Biome Inc, San Francisco, CA 94107
- name: Susan P Holmes
affiliation: Statistics Department, Stanford University, CA 94305
keywords: Microbiome, Denoising, 16S rRNA, phylogenetics, microbial ecology, multiway data, sparse CCA.
bibliography: F1000Work.bib
date: 25 July 2017
output:
BiocStyle::html_document2:
toc_float: true
vignette: >
%\VignetteIndexEntry{Microbiome workflow from raw reads}
%\VignetteEngine{knitr::rmarkdown}
---
```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis"}
library("BiocStyle")
library("knitr")
library("rmarkdown")
options(width = 98)
opts_chunk$set(message = FALSE, error = FALSE, warning = FALSE,
cache = TRUE, fig.width = 8, fig.height = 7)
```
Abstract {#abstract .unnumbered}
===========
High-throughput sequencing of PCR-amplified taxonomic markers
(like the 16S rRNA gene) has enabled a new level of analysis of
complex bacterial communities known as microbiomes. Many tools exist
to quantify and compare abundance levels or OTU composition of
communities in different conditions. The sequencing reads have to be
denoised and assigned to the closest taxa from a reference
database. Common approaches use a notion of 97% similarity and
normalize the data by subsampling to equalize library sizes. In this
paper, we show that statistical models allow more accurate abundance
estimates. By providing a complete workflow in R, we enable the user
to do sophisticated downstream statistical analyses, whether
parametric or nonparametric. We provide examples of using the R
packages `dada2, phyloseq, DESeq2, ggplot2`, `structSSI` and `vegan` to filter, visualize and
test microbiome data. We also provide examples of supervised analyses using random
forests and nonparametric testing using community networks and the `ggnetwork`
package.
Introduction {#introduction .unnumbered}
============
The **microbiome** is formed of the ecological communities of
microorganisms that dominate the living world. Bacteria can now be
identified through the use of next generation sequencing applied at
several levels. Shotgun sequencing of all bacteria in a sample delivers
knowledge of all the genes present. Here we will only be interested in
the identification and quantification of individual taxa (or species)
through a ‘fingerprint gene’ called 16s rRNA which is present in all
bacteria. This gene presents several variable regions which can be used
to identify the different taxa.
Previous standard workflows depended on clustering all 16s rRNA
sequences (generated by next generation amplicon sequencing) that occur
within a 97% radius of similarity and then assigning these to ‘OTUs’
from reference trees [@caporaso2010qiime; @mothur]. These approaches do
not incorporate all the data, in particular sequence quality information
and statistical information available on the reads were not incorporated
into the assignments.
In contrast, the **de novo** read counts used here will be
constructed through the incorporation of both the quality scores and
sequence frequencies in a probabilistic noise model for nucleotide
transitions. For more details on the algorithmic implementation of this
step see [@dada2].
After filtering the sequences and removing the chimeræ, the data are
compared to a standard database of bacteria and labeled. In this
workflow, we have used the labeled sequences to build a de novo
phylogenetic with the .
The key step in the sequence analysis is the manner in which reads are
denoised and assembled into groups we have chosen to call **ASV**s
(Amplicon Sequence Variants)[@callahan2017exact] instead of the traditional OTUs(Operational Taxonomic Units).
A published (but essentially similar) version of this workflow, including reviewer reports and comments
is available [@callahan2016F1000], see
[F1000Research From Raw reads](https://f1000research.com/articles/5-1492/v2).
There are extensive documentation and tutorial pages available for [dada2](https://benjjneb.github.io/dada2/tutorial.html)
and [phyloseq](https://joey711.github.io/phyloseq/tutorials-index.html).
If you have questions about this workflow, please start
by consulting the relevant github issues sites for [dada2](https://github.com/benjjneb/dada2/issues), [phyloseq](https://github.com/joey711/phyloseq/issues), if the answers
are not available, please post to the issues pages or
[Bioconductor forum](https://support.bioconductor.org/).
Note the
[posting guide](http://www.bioconductor.org/help/support/posting-guide/)
for crafting an optimal question for the support site.
This is a workflow
for denoising, filtering, performing data transformations, visualization, supervised
learning analyses, community network tests, hierarchical testing and
linear models. We provide all the code and give several examples of
different types of analyses and use-cases. There are often many
different objectives in experiments involving microbiome data and we
will only give a flavor for what could be possible once the data has
been imported into R.
In addition, the code can be easily adapted to accommodate batch
effects, covariates and multiple experimental factors.
This workflow is based on software packages from the open-source
Bioconductor project [@Huber:2015] and some CRAN packages.
We provide all steps necessary from
the denoising and identification of the reads input as raw sequences as
[fastq](https://en.wikipedia.org/wiki/FASTQ_format) files to the comparative testing and multivariate analyses of
the samples and analyses of the abundances according to multiple
available covariates.
Methods {#methods .unnumbered}
=======
Amplicon bioinformatics: from raw reads to tables {#amplicon-bioinformatics-from-raw-reads-to-tables .unnumbered}
-------------------------------------------------
This section demonstrates the “full stack” of amplicon bioinformatics:
construction of the sample-by-sequence feature table from the raw reads,
assignment of taxonomy, and creation of a phylogenetic tree relating the
sample sequences.
First we load the necessary packages.
```{r, message=FALSE, warning=FALSE}
library("knitr")
library("BiocStyle")
.cran_packages <- c("ggplot2", "gridExtra")
.bioc_packages <- c("dada2", "phyloseq", "DECIPHER", "phangorn")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
install.packages(.cran_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
source("http://bioconductor.org/biocLite.R")
biocLite(.bioc_packages[!.inst], ask = F)
}
# Load packages into session, and print package version
sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE)
set.seed(100)
```
The data we will analyze here are highly-overlapping Illumina Miseq
2x250 amplicon sequences from the V4 region of the 16S gene
[@Kozich2013]. These 360 fecal samples were collected from 12 mice
longitudinally over the first year of life one mock community control.
These were collected to investigate the
development and stabilization of the murine microbiome
[@schloss2012stabilization].
These data are downloaded from the
following [data location](http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip) and unzip.
For now just consider them paired-end fastq files to be processed.
Define the following path variable so that it points to the extracted directory on **your** machine:
```{r path}
miseq_path <- "./MiSeq_SOP" # CHANGE to the directory containing the fastq files after unzipping.
list.files(miseq_path)
```
Filter and Trim {#trim-and-filter .unnumbered}
------------------
We begin by filtering out low-quality sequencing reads and trimming the
reads to a consistent length. While generally recommended filtering and
trimming parameters serve as a starting point, no two datasets are
identical and therefore it is always worth inspecting the quality of the
data before proceeding.
First we read in the names of the fastq files, and perform some string manipulation to
get lists of the forward and reverse fastq files in matched order:
```{r filenames}
# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(miseq_path, pattern="_R1_001.fastq"))
fnRs <- sort(list.files(miseq_path, pattern="_R2_001.fastq"))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sampleNames <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(miseq_path, fnFs)
fnRs <- file.path(miseq_path, fnRs)
fnFs[1:3]
fnRs[1:3]
```
Most Illumina sequencing data shows a trend of decreasing average
quality towards the end of sequencing reads.
The first two forward reads:
```{r see-quality-F}
plotQualityProfile(fnFs[1:2])
```
The first two reverse reads:
```{r see-quality-R}
plotQualityProfile(fnRs[1:2])
```
Here, the forward reads maintain high quality throughout, while the
quality of the reverse reads drops significantly at about position 160.
Therefore, we choose to truncate the forward reads at position 245, and
the reverse reads at position 160. We also choose to trim the first 10
nucleotides of each read based on empirical observations across many
Illumina datasets that these base positions are particularly likely to
contain pathological errors.
We define the filenames for the filtered fastq.gz files:
```{r filt-names}
filt_path <- file.path(miseq_path, "filtered") # Place filtered files in filtered/ subdirectory
if(!file_test("-d", filt_path)) dir.create(filt_path)
filtFs <- file.path(filt_path, paste0(sampleNames, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sampleNames, "_R_filt.fastq.gz"))
```
We combine these trimming parameters with standard filtering parameters,
the most important being the enforcement of a maximum of 2 expected
errors per-read [@edgar2015unoise]. Trimming and filtering is performed
on paired reads jointly, i.e. both reads must pass the filter for the
pair to pass.
**Filter the forward and reverse reads**:
```{r filter, message=FALSE, warning=FALSE}
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
head(out)
```
Infer sequence variants {#infer-sequence-variants .unnumbered}
-----------------------
After filtering, the typical amplicon bioinformatics workflow clusters
sequencing reads into operational taxonomic units (OTUs): groups of
sequencing reads that differ by less than a fixed dissimilarity
threshhold. Here we instead use the high-resolution DADA2 method to to
infer amplicon sequence variants (ASVs) exactly, without imposing any
arbitrary threshhold, and thereby resolving variants that differ by as
little as one nucleotide [@dada2].
The sequence data is imported into R from demultiplexed fastq files
(i.e. one fastq for each sample) and simultaneously dereplicated to
remove redundancy. We name the resulting `derep-class` objects by
their sample name.
### Dereplication {.unnumbered}
Dereplication combines all identical sequencing reads into into "unique sequences" with a corresponding "abundance": the number of reads with that unique sequence. Dereplication substantially reduces computation time by eliminating redundant comparisons.
```{r dereplicate, message=FALSE}
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sampleNames
names(derepRs) <- sampleNames
```
The DADA2 method relies on a parameterized model of substitution errors
to distinguish sequencing errors from real biological variation. Because
error rates can (and often do) vary substantially between sequencing
runs and PCR protocols, the model parameters can be discovered from the
data itself using a form of unsupervised learning in which sample
inference is alternated with parameter estimation until both are jointly
consistent.
Parameter learning is computationally intensive, as it requires multiple
iterations of the sequence inference algorithm, and therefore it is
often useful to estimate the error rates from a (sufficiently large)
subset of the data.
```{r learnErrorRates}
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)
```
```{r plotrates, fig.show= "hold", fig.height = 8, fig.cap= "Estimated Error rates (both forward and reverse)"}
plotErrors(errF)
plotErrors(errR)
```
In order to verify that the error rates have been reasonably
well-estimated, we inspect the fit between the observed error rates
(black points) and the fitted error rates (black lines) in
Figure \@ref(fig:plotrates). These figures show the frequencies of
each type of transition as a function of the quality.
The DADA2 sequence inference method can run in two different modes:
Independent inference by sample (`pool=FALSE`), and inference from the
pooled sequencing reads from all samples (`pool=TRUE`). Independent
inference has the advantage that computation time is linear in the
number of samples, and memory requirements are flat with the number of
samples. This allows scaling out to datasets of almost unlimited size.
Pooled inference is more computationally taxing, and can become
intractable for datasets of tens of millions of reads. However, pooling
improves the detection of rare variants that were seen just once or
twice in an individual sample but many times across all samples. As this
dataset is not particularly large, we perform pooled inference. As of
version 1.2, multithreading can now be activated with the arguments
`multithread = TRUE`, which substantially speeds this step.
```{r dadaStep}
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)
```
Inspecting the dada-class object returned by dada:
```{r see-dada}
dadaFs[[1]]
```
The DADA2 algorithm inferred `r length(dadaFs[[1]]$sequence)` real sequence variants from the `r length(dadaFs[[1]]$map)` unique sequences in the first sample. The `dada-class` object
contains multiple diagnostics about the quality of each inferred sequence variant(see `help("dada-class")` for some info).
The DADA2 sequence inference step removed (nearly) all substitution and
indel errors from the data [@dada2]. We now merge together the inferred
forward and reverse sequences, removing paired sequences that do not
perfectly overlap as a final control against residual errors.
Construct sequence table and remove chimeras {#construct-sequence-table-and-remove-chimeras .unnumbered}
--------------------------------------------
The DADA2 method produces a sequence table that is a higher-resolution
analogue of the common “OTU table”, i.e. a sample by sequence feature
table valued by the number of times each sequence was observed in each
sample.
```{r mergers}
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
```
```{r seqtab}
seqtabAll <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
table(nchar(getSequences(seqtabAll)))
```
Notably, chimeras have not yet been removed. The error model in the sequence inference algorithm does not include a chimera component, and therefore we expect this sequence table to include many chimeric sequences. We now remove chimeric sequences by comparing each inferred sequence to the others in the table, and removing those that can be reproduced by stitching together two more abundant sequences.
```{r chimeras}
seqtabNoC <- removeBimeraDenovo(seqtabAll)
```
Although exact numbers vary substantially by experimental condition, it is typical that chimeras comprise a substantial fraction of inferred sequence variants, but only a small fraction of all reads.
That is what is observed here chimeras make up about `r round(100*(ncol(seqtabAll)-ncol(seqtabNoC))/ncol(seqtabAll))`\% of the inferred sequence variants, but those variants account for only about `r round(100*(sum(seqtabAll)-sum(seqtabNoC))/sum(seqtabAll))`\% of the total sequence reads.
Assign taxonomy {#assign-taxonomy .unnumbered}
---------------
One of the benefits of using well-classified marker loci like the 16S
rRNA gene is the ability to taxonomically classify the sequence
variants. The `r Biocpkg("dada2")`
package
implements the naive Bayesian classifier method for this purpose
[@wang2007naive]. This classifier compares sequence variants to a
training set of classified sequences, and here we use the RDP v16
training set [@cole2009rdp].
The dada2 tutorial website contains [formatted training fastas for the RDP training set, GreenGenes clustered at 97\% identity, and the Silva reference database](https://benjjneb.github.io/dada2/training.html) available.
For fungal taxonomy, the General Fasta release files from the [UNITE ITS database](https://unite.ut.ee/repository.php) can be used as is. To follow this workflow, download the [`rdp_train_set_16.fa.gz`](https://doi.org/10.5281/zenodo.801827) file, and place it in the directory with the fastq files.
```{r tax}
fastaRef <- "./rdp_train_set_16.fa.gz"
taxTab <- assignTaxonomy(seqtabNoC, refFasta = fastaRef, multithread=TRUE)
unname(head(taxTab))
```
Construct phylogenetic tree {#construct-phylogenetic-tree .unnumbered}
---------------------------
Phylogenetic relatedness is commonly used to inform downstream analyses,
especially the calculation of phylogeny-aware distances between
microbial communities. The DADA2 sequence inference method is
reference-free, so we must construct the phylogenetic tree relating the
inferred sequence variants de novo. We begin by performing a
multiple-alignment using the `r Biocpkg("DECIPHER")` R package
[@wright2015decipher].
```{r msa, output=FALSE,message=FALSE}
seqs <- getSequences(seqtabNoC)
names(seqs) <- seqs # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA,verbose=FALSE)
```
The `r CRANpkg("phangorn")` R package is then used to construct a phylogenetic tree.
Here we first construct a neighbor-joining tree, and then fit a GTR+G+I
(Generalized time-reversible with Gamma rate variation)
maximum likelihood tree using the neighbor-joining tree as a starting point.
```{r tree}
phangAlign <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phangAlign)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phangAlign)
fitGTR <- update(fit, k=4, inv=0.2)
fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "stochastic", control = pml.control(trace = 0))
detach("package:phangorn", unload=TRUE)
```
Combine data into a phyloseq object {#combine-data-into-a-phyloseq-object .unnumbered}
-----------------------------------
The package `phyloseq` organizes and synthesizes the different data types from a
typical amplicon sequencing experiment into a single data object that
can be easily manipulated. The last bit of information needed is the
sample data contained in a `.csv` file.
This can be downloaded from github:
```{r samdat}
samdf <- read.csv("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/MIMARKS_Data_combined.csv",header=TRUE)
samdf$SampleID <- paste0(gsub("00", "", samdf$host_subject_id), "D", samdf$age-21)
samdf <- samdf[!duplicated(samdf$SampleID),] # Remove dupicate entries for reverse reads
rownames(seqtabAll) <- gsub("124", "125", rownames(seqtabAll)) # Fix discrepancy
all(rownames(seqtabAll) %in% samdf$SampleID) # TRUE
rownames(samdf) <- samdf$SampleID
keep.cols <- c("collection_date", "biome", "target_gene", "target_subfragment",
"host_common_name", "host_subject_id", "age", "sex", "body_product", "tot_mass",
"diet", "family_relationship", "genotype", "SampleID")
samdf <- samdf[rownames(seqtabAll), keep.cols]
```
The full suite of data for this study -- the sample-by-sequence
feature table, the sample metadata, the sequence taxonomies, and the
phylogenetic tree -- can now be combined into a single object.
```{r phyloseqObj}
ps <- phyloseq(otu_table(seqtabNoC, taxa_are_rows=FALSE),
sample_data(samdf),
tax_table(taxTab),phy_tree(fitGTR$tree))
ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample
ps
```
# Using phyloseq {.unnumbered}
`phyloseq`[@phyloseqplosone] is an R package to import, store, analyze, and
graphically display complex phylogenetic sequencing data that has
already been clustered into Operational Taxonomic Units (OTUs) or more
appropriately denoised, and it is most useful when there is also
associated sample data, phylogeny, and/or taxonomic assignment of each
taxa. leverages and builds upon many of the tools available in R for
ecology and phylogenetic analysis (ape, vegan, ade4), while also using
advanced/flexible graphic systems (ggplot2) to easily produce
publication-quality graphics of complex phylogenetic data. The
`r Biocpkg("phyloseq")` package uses a specialized system of S4 data classes to
store all related phylogenetic sequencing data as a single,
self-consistent, self-describing experiment-level object, making it
easier to share data and reproduce analyses. In general, phyloseq seeks
to facilitate the use of R for efficient interactive and reproducible
analysis of amplicon count data jointly with important sample
covariates.
This tutorial shows a useful example workflow,
but many more analyses are available to you in phyloseq, and R in general,
than can fit in a single workflow.
The [phyloseq home page](http://joey711.github.io/phyloseq/)
is a good place to begin browsing additional phyloseq documentation,
as are the three vignettes included within the package,
and linked directly at [the phyloseq release page on Bioconductor](http://bioconductor.org/packages/release/bioc/html/phyloseq.html).
## Loading the data {.unnumbered}
Many use cases result in the need to import and combine different
data into a phyloseq class object,
this can be done using th `import_biom`
function to read recent QIIME format files, older files
can still be imported with `import_qiime`.
More complete details can be found on the
[phyloseq FAQ page](https://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-FAQ.html).
In the previous section the results of `r Biocpkg("dada2")` sequence processing
were organized into a phyloseq object. We have actually
run dada2 on a larger set of samples from the same data source.
This object was also saved in R-native serialized RDS format.
We will re-load this here for completeness as the initial object `ps`.
If you have not downloaded the whole repository you can access the `ps`
file though github:
```{r non-local}
ps_connect <-url("https://raw.githubusercontent.com/spholmes/F1000_workflow/master/data/ps.rds")
ps = readRDS(ps_connect)
ps
```
## Shiny-phyloseq {.unnumbered}
It can be beneficial to start the data exploration
process interactively, this often saves time
in detecting outliers and specific features of the data.
[Shiny-phyloseq](http://joey711.github.io/shiny-phyloseq/)
[@mcmurdie2015] is an interactive web application that provides a graphical user interface to
the `r Biocpkg("phyloseq")` package.
The object just loaded into the R session in this workflow
is suitable for graphical exploration with Shiny-phyloseq.
Filtering {#filtering .unnumbered}
---------
`r Biocpkg("phyloseq")` provides useful tools for filtering, subsetting, and agglomerating taxa
– a task that is often appropriate or even necessary for effective
analysis of microbiome count data. In this subsection, we graphically
explore the prevalence of taxa in the example dataset, and demonstrate
how this can be used as a filtering criteria. One of the reasons to
filter in this way is to avoid spending much time analyzing taxa that
were seen only rarely among samples. This also turns out to be a useful
filter of noise (taxa that are actually just artifacts of the data
collection process), a step that should probably be considered essential
for datasets constructed via heuristic OTU-clustering methods, which are
notoriously prone to generating spurious taxa.
### Taxonomic Filtering {#taxonomic-filtering .unnumbered}
In many biological settings, the set of all organisms from all samples
are well-represented in the available taxonomic reference database. When
(and only when) this is the case, it is reasonable or even advisable to
filter taxonomic features for which a high-rank taxonomy could not be
assigned. Such ambiguous features in this setting are almost always
sequence artifacts that don’t exist in nature. It should be obvious that
such a filter is not appropriate for samples from poorly characterized
or novel specimens, at least until the possibility of taxonomic novelty
can be satisfactorily rejected. Phylum is a useful taxonomic rank to
consider using for this purpose, but others may work effectively for
your data.
To begin, create a table of read counts for each Phylum present in the
dataset.
```{r taxfilter0}
# Show available ranks in the dataset
rank_names(ps)
# Create table, number of features for each phyla
table(tax_table(ps)[, "Phylum"], exclude = NULL)
```
This shows a few phyla for which only one feature was observed. Those
may be worth filtering, and we’ll check that next. First, notice that in
this case, six features were annotated with a Phylum of NA. These
features are probably artifacts in a dataset like this, and should be
removed.
The following ensures that features with ambiguous phylum annotation are
also removed. Note the flexibility in defining strings that should be
considered ambiguous annotation.
```{r removeNAphyla}
ps <- subset_taxa(ps, !is.na(Phylum) & !Phylum %in% c("", "uncharacterized"))
```
A useful next step is to explore feature *prevalence* in the dataset,
which we will define here as the number of samples in which a taxon
appears at least once.
```{r prevfilter0}
# Compute prevalence of each feature, store as data.frame
prevdf = apply(X = otu_table(ps),
MARGIN = ifelse(taxa_are_rows(ps), yes = 1, no = 2),
FUN = function(x){sum(x > 0)})
# Add taxonomy and total read counts to this data.frame
prevdf = data.frame(Prevalence = prevdf,
TotalAbundance = taxa_sums(ps),
tax_table(ps))
```
Are there phyla that are comprised of mostly low-prevalence features? Compute the total and average prevalences of the features in each phylum.
```{r lowprev}
plyr::ddply(prevdf, "Phylum", function(df1){cbind(mean(df1$Prevalence),sum(df1$Prevalence))})
```
_Deinococcus-Thermus_ appeared in just over one percent of samples, and Fusobacteria appeared in just 2 samples total. In some cases it might be worthwhile to explore these two phyla in more detail despite this (though probably not Fusobacteria’s two samples). For the purposes of this example, though, they will be filtered from the dataset.
```{r taxfilter}
# Define phyla to filter
filterPhyla = c("Fusobacteria", "Deinococcus-Thermus")
# Filter entries with unidentified Phylum.
ps1 = subset_taxa(ps, !Phylum %in% filterPhyla)
ps1
```
### Prevalence Filtering {#prevalence-filtering .unnumbered}
The previous filtering steps are considered *supervised*, because they
relied on prior information that is external to this experiment (a
taxonomic reference database). This next filtering step is completely
*unsupervised*, relying only on the data in this experiment, and a
parameter that we will choose after exploring the data. Thus, this
filtering step can be applied even in settings where taxonomic
annotation is unavailable or unreliable.
First, explore the relationship of prevalence and total read count for
each feature. Sometimes this reveals outliers that should probably be
removed, and also provides insight into the ranges of either feature
that might be useful. This aspect depends quite a lot on the
experimental design and goals of the downstream inference, so keep these
in mind. It may even be the case that different types of downstream
inference require different choices here. There is no reason to expect
ahead of time that a single filtering workflow is appropriate for all
analysis.
```{r plotprevalence, fig.width=9, fig.height=5, fig.cap="Taxa prevalence versus total counts."}
# Subset to the remaining phyla
prevdf1 = subset(prevdf, Phylum %in% get_taxa_unique(ps1, "Phylum"))
ggplot(prevdf1, aes(TotalAbundance, Prevalence / nsamples(ps),color=Phylum)) +
# Include a guess for parameter
geom_hline(yintercept = 0.05, alpha = 0.5, linetype = 2) + geom_point(size = 2, alpha = 0.7) +
scale_x_log10() + xlab("Total Abundance") + ylab("Prevalence [Frac. Samples]") +
facet_wrap(~Phylum) + theme(legend.position="none")
```
Each point in Figure \@ref(fig:plotprevalence) is a different taxa.
Exploration of the data in this way is often useful for selecting
filtering parameters, like the minimum prevalence criteria we will used
to filter the data
above.
Sometimes a natural separation in the dataset reveals itself, or at
least, a conservative choice that is in a stable region for which small
changes to the choice would have minor or no effect on the biological
interpreation (stability). Here no natural separation is immediately
evident, but it looks like we might reasonably define a prevalence
threshold in a range of zero to ten percent or so. Take care that this
choice does not introduce bias into a downstream analysis of association
of differential abundance.
The following uses five percent of all samples as the prevalence
threshold.
```{r prevalencefilter}
# Define prevalence threshold as 5% of total samples
prevalenceThreshold = 0.05 * nsamples(ps)
prevalenceThreshold
# Execute prevalence filter, using `prune_taxa()` function
keepTaxa = rownames(prevdf1)[(prevdf1$Prevalence >= prevalenceThreshold)]
ps2 = prune_taxa(keepTaxa, ps)
```
Agglomerate taxa {#agglomerate-taxa .unnumbered}
----------------
When there is known to be a lot of species or sub-species functional
redundancy in a microbial community, it might be useful to agglomerate
the data features corresponding to closely related taxa. Ideally we
would know the functional redundancies perfectly ahead of time, in which
case we would agglomerate taxa using those defined relationships and the
function in phyloseq. That kind of exquisite functional data is usually
not available, and different pairs of microbes will have different sets
of overlapping functions, complicating the matter of defining
appropriate grouping criteria.
While not necessarily the most useful or functionally-accurate criteria
for grouping microbial features (sometimes far from accurate), taxonomic
agglomeration has the advantage of being much easier to define ahead of
time. This is because taxonomies are usually defined with a
comparatively simple tree-like graph structure that has a fixed number
of internal nodes, called “ranks”. This structure is simple enough for
the phyloseq package to represent taxonomies as table of taxonomy
labels. Taxonomic agglomeration groups all the “leaves” in the hierarchy
that descend from the user-prescribed agglomerating rank, this is
sometimes called ‘glomming’.
The following example code shows how one would combine all features that
descend from the same genus.
```{r taxglom}
# How many genera would be present after filtering?
length(get_taxa_unique(ps2, taxonomic.rank = "Genus"))
ps3 = tax_glom(ps2, "Genus", NArm = TRUE)
```
If taxonomy is not available or not reliable, tree-based agglomeration
is a “taxonomy-free” alternative to combine data features corresponding
to closely-related taxa. In this case, rather than taxonomic rank, the
user specifies a tree height corresponding to the phylogenetic distance
between features that should define their grouping. This is very similar to “OTU Clustering”, except that in many OTU Clustering algorithms the
sequence distance being used does not have the same (or any)
evolutionary definition.
```{r tipglom}
h1 = 0.4
ps4 = tip_glom(ps2, h = h1)
```
Here phyloseq's `plot_tree()` function
compare the original unfiltered data,
the tree after taxonoic agglomeration,
and the tree after phylogenetic agglomeration.
These are stored as separate plot objects,
then rendered together in one combined graphic
using `gridExtra::grid.arrange`.
```{r plotglomprep}
multiPlotTitleTextSize = 15
p2tree = plot_tree(ps2, method = "treeonly",
ladderize = "left",
title = "Before Agglomeration") +
theme(plot.title = element_text(size = multiPlotTitleTextSize))
p3tree = plot_tree(ps3, method = "treeonly",
ladderize = "left", title = "By Genus") +
theme(plot.title = element_text(size = multiPlotTitleTextSize))
p4tree = plot_tree(ps4, method = "treeonly",
ladderize = "left", title = "By Height") +
theme(plot.title = element_text(size = multiPlotTitleTextSize))
```
```{r plotglomtree, fig.width=14, fig.cap="Different types of agglomeration"}
# group plots together
grid.arrange(nrow = 1, p2tree, p3tree, p4tree)
```
Figure \@ref(fig:plotglomtree) shows
the original tree on the left, taxonomic agglomeration at Genus rank
in the middle and phylogenetic agglomeration at a fixed distance of 0.4
on the right.
Abundance value transformation {#abundance-value-transformation .unnumbered}
------------------------------
It is usually necessary to transform microbiome count data to account
for differences in library size, variance, scale, etc. The `phyloseq`
package provides a flexible interface for defining new functions to
accomplish these transformations of the abundance values via the
function `transform_sample_counts()`.
The first argument to this function is the phyloseq object you
want to transform, and the second argument is an R function that defines
the transformation. The R function is applied sample-wise, expecting
that the first unnamed argument is a vector of taxa counts in the same
order as the phyloseq object. Additional arguments are passed on to the
function specified in the second argument, providing an explicit means
to include pre-computed values, previously defined
parameters/thresholds, or any other object that might be appropriate for
computing the transformed values of interest.
This example begins by defining a custom plot function,
`plot_abundance()`, that uses
phyloseq’s function to define a relative abundance graphic.
We will use
this to compare more easily differences in scale and distribution of the
abundance values in our phyloseq object before and after transformation.
```{r abundancetransformation}
plot_abundance = function(physeq,title = "",
Facet = "Order", Color = "Phylum"){
# Arbitrary subset, based on Phylum, for plotting
p1f = subset_taxa(physeq, Phylum %in% c("Firmicutes"))
mphyseq = psmelt(p1f)
mphyseq <- subset(mphyseq, Abundance > 0)
ggplot(data = mphyseq, mapping = aes_string(x = "sex",y = "Abundance",
color = Color, fill = Color)) +
geom_violin(fill = NA) +
geom_point(size = 1, alpha = 0.3,
position = position_jitter(width = 0.3)) +
facet_wrap(facets = Facet) + scale_y_log10()+
theme(legend.position="none")
}
```
The transformation in this case converts the counts
from each sample into their frequencies,
often referred to as *proportions*
or *relative abundances*.
This function is so simple that it is easiest
to define it within the function call to
`transform_sample_counts()`.
```{r abundancetransformation2}
# Transform to relative abundance. Save as new object.
ps3ra = transform_sample_counts(ps3, function(x){x / sum(x)})
```
Now we plot the abundance values before and after transformation.
```{r abundancetransformation3, fig.height=12, fig.width=10.5,fig.cap="Comparison of original abundances with transformed data"}
plotBefore = plot_abundance(ps3,"")
plotAfter = plot_abundance(ps3ra,"")
# Combine each plot into one graphic.
grid.arrange(nrow = 2, plotBefore, plotAfter)
```
Figure \@ref(fig:abundancetransformation3) shows the
comparison of original abundances (top panel)
and relative abundances (lower).
Subset by taxonomy {#subset-by-taxonomy .unnumbered}
------------------
Notice on the previous plot that *Lactobacillales* appears to be a
taxonomic Order with bimodal abundance profile in the data. We can check
for a taxonomic explanation of this pattern by plotting just that
taxonomic subset of the data. For this, we subset with the function, and
then specify a more precise taxonomic rank to the argument of the
function that we defined above.
```{r subsettaxa, fig.cap= "Violin plot of relative abundances of Lactobacillales"}
psOrd = subset_taxa(ps3ra, Order == "Lactobacillales")
plot_abundance(psOrd, Facet = "Genus", Color = NULL)
```
Figure \@ref(fig:subsettaxa) shows
the relative abundances of Lactobacillales taxonomic
Order, grouped by host sex and genera. Here it is clear that the
apparent biomodal distribution of Lactobacillales on the previous plot
was the result of a mixture of two different genera, with the typical
*Lactobacillus* relative abundance much larger than
*Streptococcus*.
At this stage in the workflow, after converting raw reads to
interpretable species abundances, and after filtering and transforming
these abundances to focus attention on scientifically meaningful
quantities, we are in a position to consider more careful statistical
analysis. R is an ideal environment for performing these analyses, as
it has an active community of package developers building simple
interfaces to sophisticated techniques. As a variety of methods are
available, there is no need to commit to any rigid analysis strategy a
priori. Further, the ability to easily call packages without
reimplementing methods frees researchers to iterate rapidly through
alternative analysis ideas. The advantage of performing this full
workflow in R is that this transition from bioinformatics to
statistics is effortless.
Let's start by installing a few packages that are available for these complementary analyses:
```{r init-analysis}
.cran_packages <- c( "shiny","miniUI", "caret", "pls", "e1071", "ggplot2", "randomForest", "dplyr", "ggrepel", "nlme", "devtools",
"reshape2", "PMA", "structSSI", "ade4",
"ggnetwork", "intergraph", "scales")
.github_packages <- c("jfukuyama/phyloseqGraphTest")
.bioc_packages <- c("genefilter", "impute")
# Install CRAN packages (if not already installed)
.inst <- .cran_packages %in% installed.packages()
if (any(!.inst)){
install.packages(.cran_packages[!.inst],repos = "http://cran.rstudio.com/")
}
.inst <- .github_packages %in% installed.packages()
if (any(!.inst)){
devtools::install_github(.github_packages[!.inst])
}
.inst <- .bioc_packages %in% installed.packages()
if(any(!.inst)){
source("http://bioconductor.org/biocLite.R")
biocLite(.bioc_packages[!.inst])
}
```
We back these claims by illustrating several analyses on the mouse
data prepared above. We experiment with several flavors of exploratory
ordination before shifting to more formal testing and modeling,
explaining the settings in which the different points of view are most
appropriate. Finally, we provide example analyses of multitable data,
using a study in which both metabolomic and microbial abundance
measurements were collected on the same samples, to demonstrate that
the general workflow presented here can be adapted to the multitable
setting.
### Preprocessing {#preprocessing .unnumbered}
Before doing the multivariate projections, we will add a few columns to
our sample data, which can then be used to annotate plots. From Figure
\@ref(fig:preprocessing-setup), we see that the ages of the mice come in a
couple of groups, and so we make a categorical variable corresponding to
young, middle-aged, and old mice. We also record the total number of
counts seen in each sample and log-transform the data as an approximate
variance stabilizing transformation.
```{r preprocessing-setup, fig.cap="Histogram of age groupings",fig.show="hold"}
qplot(sample_data(ps)$age, geom = "histogram",binwidth=20) + xlab("age")
```
Figure \@ref(fig:preprocessing-setup) shows that the age covariate
belongs to three separate clusters.
```{r preprocessing2, fig.cap="Histograms comparing raw and log transformed read depths",fig.show="hold"}
qplot(log10(rowSums(otu_table(ps))),binwidth=0.2) +
xlab("Logged counts-per-sample")
```
These preliminary plots suggest certain preprocessing steps.
The
histogram in Figure \@ref(fig:preprocessing-setup)
motivates the creation of a new categorical
variable, binning age into one of the three peaks.
The histogram in Figure \@ref(fig:preprocessing2)
suggests that a $\log\left(1 + x\right)$ transformation
might be
sufficient for `normalizing` the abundance data for the exploratory analyses.
In fact this transformation is not sufficient for
testing purposes and when performing differential abundances
we recommend the variance stabilizing transformations
available in DESeq2 through the `phyloseq_to_deseq2` function,
see the [phyloseq_to_deseq2 tutorial here](https://bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-mixture-models.html).
As our first step, we look at principal coordinates analysis (PCoA) with
either the Bray-Curtis dissimilarity on the weighted Unifrac
distance.
```{r outlier-detect, fig.cap="Exploratory ordination analysis with log abundances.",fig.wide= TRUE}
sample_data(ps)$age_binned <- cut(sample_data(ps)$age,
breaks = c(0, 100, 200, 400))
levels(sample_data(ps)$age_binned) <- list(Young100="(0,100]", Mid100to200="(100,200]", Old200="(200,400]")
sample_data(ps)$family_relationship=gsub(" ","",sample_data(ps)$family_relationship)
pslog <- transform_sample_counts(ps, function(x) log(1 + x))
out.wuf.log <- ordinate(pslog, method = "MDS", distance = "wunifrac")
evals <- out.wuf.log$values$Eigenvalues
plot_ordination(pslog, out.wuf.log, color = "age_binned") +
labs(col = "Binned Age") +
coord_fixed(sqrt(evals[2] / evals[1]))
```
Figure \@ref(fig:outlier-detect) showing the ordination on the logged abundance data reveals a few outliers.
These turn
out to be the samples from females 5 and 6 on day 165 and the samples
from males 3, 4, 5, and 6 on day 175. We will take them out, since we
are mainly interested in the relationships between the non-outlier
points.
Before we continue, we should check the two female
outliers -- they have been taken over by the same OTU/ASV, which has a
relative abundance of over 90\% in each of them. This is the only time
in the entire data set that this ASV has such a high relative
abundance -- the rest of the time it is below 20\%. In particular, its
diversity is by far the lowest of all the samples.
```{r outlier-analyze, fig.width=9, fig.height=5, fig.cap="The outlier samples are dominated by a single ASV."}
rel_abund <- t(apply(otu_table(ps), 1, function(x) x / sum(x)))
qplot(rel_abund[, 12], geom = "histogram",binwidth=0.05) +
xlab("Relative abundance")
```
# Different Ordination Projections {#different-ordination-projections .unnumbered}
As we have seen, an important first step in analyzing microbiome data is
to do unsupervised, exploratory analysis. This is simple to do in
`phyloseq`,
which provides many distances and ordination methods.
After documenting the outliers, we are going to compute ordinations with
these outliers removed and more carefully study the output.
```{r remove-outliers}
outliers <- c("F5D165", "F6D165", "M3D175", "M4D175", "M5D175", "M6D175")
ps <- prune_samples(!(sample_names(ps) %in% outliers), ps)
```
We are also going to remove samples with fewer than 1000 reads: