forked from coevoeco/GeneNet
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Walkthrough.txt
380 lines (285 loc) · 32.2 KB
/
Walkthrough.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
This walkthrough is intended as a quick guide for recreating analyses in the manuscript, "Gene networks provide a high-resolution view of bacteriophage ecology" by Jason W Shapiro and Catherine Putonti. It can also be adapted to analyze additional datasets. It generally assumes that the reader is comfortable with R and creating/navigating their own directory structures. (Note: this walkthrough was updated on 12/21/2017 with a new section for comparing phage host relatedeness to gene network topology. See Part Two: Comparing Gene Network Topology to Host Relatedenss.)
----------------------------------------------------------------------------------------
Step One: Gather data and prepare the concatenated FAA file
Note: this step is already completed and available as "allphagefaa.fa" in "Input Data."
a) Download all FAA files associated with the accessions in phagehostdatasummary.txt. Sequences can also be downloaded in gbk format and converted to FAA format using the R function gbk2faa() (provided in GeneNetworkRscripts.r).
b) Relabel each gene identifier using the genome accession and an index, for downstream convenience. E.g. Gene i of phage T4 would be labeled NC\_000866\_i. This can be done quickly in R if each FAA file is saved in the same directory and if each is named in the format "accession.faa," where the filename matches the genome accession. First create a new output directory to contain the relabeled files ('FAArelab' below). Then, move to the directory with the original FAA files and use the following (or equivalent):
filelist=list.files()
for(i in 1:length(filelist))
{
faaseq=readLines(filelist[i]) #reads in file i
genacc=strsplit(filelist[i],'.faa')[[1]] #extracts the genome accession from the filename
genepos=grep(">",faaseq) #finds the start position of each gene in the FAA file
faaseq[genepos]=paste(">",genacc,'_',c(1:length(genepos)),sep='') #relabels each gene identifier in the format ">accession_n" where n stands in for the order of the genes.
write.table(faaseq,paste("../FAArelab/",filelist[i],sep=''),row.names=F,col.names=F,quote=F) #writes file using original name to FAArelab directory
}
c) concatenate the relabeled FAA files into one FASTA file. To avoid any escape character issues from using 'cat *.faa' in the command line, I prefer to do this in R. While in the FAArelab directory:
filelist=list.files()
allseq=character()
for(i in 1:length(filelist))
{
allseq=c(allseq,readLines(filelist[i]))
}
write.table(allseq,'../allseq.fa',row.names=F,col.names=F,quote=F) #writes the new file to the parent directory containing FAArelab
#Note: allseq can also be built as you go during step (b)
Step Two: Cluster the protein sequences using usearch with the function cluster_fast.
The main parameter to specify here is the ID threshold used by usearch. Because protein sequences are being clustered, there is no need to specify how usearch should treat the strand direction. We ran this step using IDs from 0.20 (20%) to 0.95 (95%) and ultimately report results from 0.35, shown in the example below. A sample line of code would be:
'usearch -cluster_fast allseq.fa -id 0.35 -clusters ./phageuclust35/pc_'
The option '-clusters' tells usearch to output a separate fasta file for each of the clusters it infers. By default, the first sequence in each file is the centroid of that cluster. We used 'phageuclust35' as the output directory and 'pc_' provides a prefix that will label each cluster, starting with pc_0.
Sample output "phageuclust35.tar.gz" is available via figshare at https://figshare.com/s/cba533ddfd55e9cf75a8.
Step Three: Convert the set of protein clusters into a genome-gene presence-absence matrix
Because each gene is relabeled to contain the accession of its source genome, it is straightforward to build a matrix of genome-gene associations by reading in each cluster output file. The function clust2pres() (provided in clust2pres.r) takes in the cluster directory and a list of the original genome accessions and outputs this matrix. (Note: clust2pres() relies on sparse matrices and requires the package 'Matrix'.) By default, it excludes any clusters with only 1 or 2 sequences (the option 'trim = T' in the function arguments), but these can be included by adding 'trim = F' when running the function. There are a number of ways to prepare the list of phage accessions. One quick way is to save them as a single column text file and then read this in separately. We did this using a file called 'phagelist.txt'. The function is then run while in the cluster directory as:
phagelist=readLines('phagelist.txt')
pres35=clust2pres(getwd(),phagelist)
Step Four: Build network output files using igraph
a) Create adjacency matrices:
The matrix pres35 has rows corresponding to genomes, and columns corresponding to the clusters of homologous genes identified using usearch. Each entry {i,j} is either a 1 if gene j is in genome i, or a 0 otherwise. It therefore represents the adjacency matrix of a bipartite graph between genomes and genes. This is easily converted to an adjacency matrix connecting genomes if they share a gene or genes if they are members of the same genome by using matrix multiplication as follows:
padj=sign(pres35%*%t(pres35)
gadj=sign(t(pres35)%*%pres35)
where padj is the genome-level adjacency matrix and gadj is the gene-level adjacency matrix. The binary operator '%*%' performs matrix multiplication, and t() creates the matrix transpose. Without using the sign() function, these products would instead provide weighted adjacency matrices where the entries for padj, for instance, would be the number of genes shared between two genomes. While there may be value in using a weighted approach in future work, we elected to use the sign() function to create unweighted adjacency matrices to simplify the analyses.
b) create graphs using igraph:
The following step requires the package igraph. Note: there are multiple synonymous functions provided in igraph that can carry out this step:
pnet=graph_from_adjacency_matrix(padj)
gnet=graph_from_adjacency_matrix(gadj)
where pnet and gnet are the igraph graph objects for the genome and gene-level networks.
c) create and export edgelists:
pedge=as_edgelist(pnet)
gedge=as_edgelist(gnet)
write.table(pedge,'pedge.abc',row.names=F,col.names=F,quote=F,sep='\t') #the '.abc' file extension is used to match the syntax found in the MCL protocol on micans.org in later steps.
write.table(gedge,'gedge.abc',row.names=F,col.names=F,quote=F,sep='\t')
Note: these edgelists will contain both the edge from "node A" to "node B" and from "node B" to "node A," which is necessary so MCL can interpret the graph as undirected, but it roughly doubles the memory required to draw the network using Cytoscape. Smaller, undirected edgelists can be prepared by repeating 4b above but changing "padj" to "padj*upper.tri(padj)" and similarly for gadj. The function upper.tri() creates an upper-triangular matrix of 1s and 0s of dimensions matching the input matrix, and including this step removes the reciprocal edges from the edgelist.
Step Five: Find graphical clusters using MCL
We followed the first Protocol from micans.org for using MCL. One example is provided below using an inflation parameter of 2 and the genome level network edgelist, pedge.abc. This was repeated as described in the text for different inflation parameters and for the gene network. (This code is run from a UNIX command line, not within R.):
mcxload -abc pedge.abc --stream-mirror -write-tab pedge.tab -o pedge.mci
mcl pedge.mci -I 2 -o pedge.I20
mcxdump -icl pedge.I20 -tabr pedge.tab -o pedge.I20.dump
The output file "pedge.I20.dump" will then be used in remaining steps for analysis using R.
Step Six: Calculate and Maximize Mutual Information between MCL clusters and Host associations (all functions contained in MIfunctions.r)
Phage-host associations are provided at the species and genus levels in the files, phostspec.txt and phostgen.txt. These are two-column text files without headers, where the first column is the genome accession of a phage, and the second column is the species (with additional strain level information when available) of the lab host or the genus of that host. In order to calculate mutual information, these host associations are connected to MCL clusters in a matrix format using either the function chmake() or chmakeg() for the genome and gene-level networks, respectively. In the genome network, this matrix has rows corresponding to MCL clusters and columns corresponding to hosts. Each entry is the sum of the number of times each host appears within that MCL cluster. In the gene network, the matrix is similarly-structured, but the sum includes the host associated with each individual homolog associated with each node within each MCL cluster.
a) create the cluster-host matrices (example for gene network provided). Note: the function chmakeg() assumes the MCL output has already been read into R and formatted appropriately. The path to the cluster file can also be used as input, in which case the second line of the function, beginning with "#gmcl=strsplit(..." should be uncommented, and the 3rd line should be commented instead. This example leaves the function commented as in the file provided.
cluster20=strsplit(readLines('gedge35.I20.dump'),'\t') #the original "dump" file provides 1 line for each cluster, with tabs separating each gene in the cluster.
#Using strsplit here convert this to a list object where each entry is a character vector of gene identifiers.
phostgen=as.matrix(read.table('phostgen.txt',header=F,sep='\t')) #reads in the desired phage-host association matrix. Converts to matrix format, rather than keeping as a data.frame object.
chmatg20=chmakeg(cluster=cluster20, hostmat=phostgen, presmat=pres35) #Note that this function uses the same genome-gene presence-absence matrix as in step 4a above.
b) Calculate the mutual information between clusters and hosts using the cluster-host matrix as input.
The function micalc() will output the mutual information between clusters and hosts by running micalc(chmatg20). This function will return an error, however, if any of the rows or columns contain only 0s. This can be corrected (or prevented) by removing any all-zero rows and columns with the function matclean().
c) Maximize mutual information between clusters and hosts using mimax()
The mimax() function takes as input the cluster-host matrix, the original genome-gene presence-absence matrix, the MCL cluster output (after processing in step 6a), and the number of iterations to run the algorithm. E.g.
mimat20=mimax(chmat=chmatg20, origpres=pres35, gmcl=cluster20, nsteps=7000)
In this example, we would run mimax for 7000 iterations, because that was observed to be sufficient (for our data) to achieve a stable mutual information. mimax() returns results as a list object, with:
mat = the reduced cluster-host matrix following all mimax iterations
val = the time series of how mutual information changed over the course of running mimax
pres = the reduced genome-gene presence-absence matrix, containing only those genes present in the reduced set of MCL clusters following mimax
genelist = a character vector of the reduced set of genes following mimax (redundant with the column names of the output from 'pres' above).
Step Seven: Generate host predictions from a cluster-host matrix (when testing phage already in the network)
This step can be carried out using any cluster-host matrix as input. In the accompanying manuscript, we did this with the original cluster-host matrix and also following mimax.
The function hostpred() (found in hostpred.r) identifies the MCL clusters corresponding to the genes in a single phage's genome and calculates
a score for each host based on how often that host is associated with the associated MCL clusters in the rest of the network. When testing predictions
for phage already contained in the network, it takes a given phage's accession as input, along with a matrix called "mclmat" where the rows are numbered ids
for each MCL cluster and the columns are the individual genes. 1s indicate that gene j is found inside MCL cluster i. hostpred() also requires the original
presence-absence matrix, the desired cluster-host matrix, the phage-host association matrix (from what is known a priori). The option "corr = T"
is used when testing a phage already contained within the network. This uses the function hostselfcorr() to adjust the resulting host score vector to subtract away
the contribution of a phage to its own prediction.
a) Create the MCL-gene matrix, mclmat
The following syntax will generate an appropriate MCL-gene matrix, mclmat (continuing with the same variables as used in previous steps):
genelist=unique(unlist(cluster20)) #extracts all gene identifiers from the dataset
mclmat=array(0,dim=c(length(cluster20),length(genelist))) #sets up an empty array
colnames(mclmat)=genelist #sets up column and row names
rownames(mclmat)=paste("mcl",c(1:length(cluster20)),sep="_")
for(i in 1:length(cluster20)){mclmat[i,match(cluster20[[i]],genelist)]=1} #fills in the matrix, iterating through each MCL cluster
b) Generating a single host prediction (for the first phage, T4, in the "phagelist") at the genus level:
pred=hostpred(acc=phagelist[1],presmat=mimat$pres,mimat=mimat$mat,mclmat=mclmat,phosts=phostgen,corr=T)
#this will return the vector of scores. To get the host with the highest score, do:
bestpred=names(which.max(pred))
(To generate predictions for every phage in phagelist, we did a simple for loop and stored the host with highest scores for each in a character vector.)
Step 8: Generating predictions for a novel phage genome
We restricted our test of novel phage to phages outside our original dataset that were already annotated as infecting one of the hosts infected
by a phage in the network. That said, any phage could presumably be tested with hostpred. In order to test a new phage, its genome should first
be downloaded in FAA format, if available. If unavailable, gbk2faa() can be used to convert the genbank flatfile into FAA format. This FAA file
is then blasted using command line blastp against a blast database of the centroid sequence for each cluster in the network.
a) Create the centroid blast database
#First build a file with all concatenated centroids. From within the directory containing each of the usearch gene clusters (here 'phageuclust35'):
filelist=list.files()
allcentroids=character()
for(i in 1:length(filelist))
{
tempseq=readLines(filelist[i])
stop=grep(">",tempseq)[2]-1 #grep finds all of the gene positions in the file. The centroid is the first sequence, so this determines where it ends.
allcentroids=c(allcentroids,paste('>',filelist[i],sep='')) #names the centroid sequence by the usearch cluster ID, not by the original gene ID
allcentroids=c(allcentroids,tempseq[2:stop]) #fills in the centroid sequence
}
write.table(allcentroids,'../allcentroids35.fa',row.names=F,col.names=F,quote=F)
#Use the concatenated centroids to make a local blast database:
makeblastdb -in allcentroids35.fa -dbtype prot -out ./centroidDB/centroidDB -parse_seqids
b) Blast each novel phage FAA file against the centroidDB using blastp. If all FAA files are contained in the same folder, this is readily done with a unix for loop
#Create a directory to contain blast output. Then navigate to directory containing each of the FAA files. The following will return csv output for each blast
for file in $( ls )
do
blastp -query $file -db ../centroidDB/centroidDB -out ../phagesamp500blast/$file.blast -outfmt 11
done
c) Use the unique blast hits for each phage to create a row vector to append to the genome-gene presence-absence matrix. This matrix is then
in a format that can be run with the original hostpred() function. The following is a block of R code that will return predictions for every novel
phage in our sample of 500 that could be tested (182 of the original 500 downloaded had hosts that were present in our existing network):
#Starting from within the directory of blast output:
blastlist=list.files()
blastaccs=unlist(strsplit(blastlist,'.faa.blast')) #recover original accession IDs for the phage genome from the blast output filename
genelist=colnames(mimat$pres) #restricts analysis to the genes still present after running mimax
predvec=character()
for(i in 1:length(blastlist))
{
tempblast=as.matrix(read.csv(blastlist[i],header=F)) #reads in the blast results
unihits=unique(tempblast[which(tempblast[,1]]<.00001&tempblast[,12]>=50),2]) #returns the unique MCL clusters corresponding to hits meeting maximum E-value and minimum bit score thresholds
if(length(unihits)==0){predvec[i]="NA"} #if there are no hits, the phage contains no genes present in the centroidDB (all of the blast results were insignificant)
if(length(unihits)>0) #If there are hits, then the following is run:
{
temppres=rbind(mimat$pres,rep(0,length(mimat$pres[1,]))) #a new empty row is added to the presence-absence matrix
rownames(temppres)[914]=blastaccs[i] #it is named according to the new phage's accession
gmatch=match(unihits,genelist) #the unique hits are matched to the list of genes (using usearch ids)
if(length(gmatch[! is.na(gmatch)])>0) #if at least one unique hit corresponds to a gene in the mimax-reduced network, then:
{
temppres[914,gmatch[! is.na(gmatch)]]=1 #matching genes are indicated with a 1 in the new row
predvec[i]=names(which.max(hostpred(blastaccs[i],temppres,mimat$pres,mclmat,phostgen,corr=F)$host)) #hostpred() is run as normal, using the new phage accession as input and not correcting for self-contribution (since there is none)
}
if(length(gmatch[! is.na(gmatch)])==0){predvec[i]="NA"} #if none of the unique hits matches a gene in the mimax-reduced set, then return 'NA'
}
}
##Additional Code used in the Manuscript##
#Determining the ICCC values for a given network and set of MCL clusters.
The function findICCC() determines the "intracluster clustering coefficient" as described in Lima-Mendez 2008. It is available in the GeneNetworkRscripts.r file.
#Creating a regression plot with confidence intervals with CIplot() (in CIplot.)
#CIplot() takes as input a two-column matrix without headers, where column 1 corresponds to the x axis, and 2 to the y axis. It then runs and plots
#a linear regression of y~x including confidence bands set at a given level (using the level argument). x and y labels can be specified with xlabel and ylabel options.
#by default, it just creates a plot, not an output file, but if out=T, it will create a pdf output with filename set by outname.
----------------------------------------------------------------------------------------
Part Two: Comparing Gene Network Topology to Host Relatedenss
In community ecology, MPD stands for "mean phylogenetic distance" between all pairs of taxa observed in a community and is a measure of diversity that accounts for the relatedness among species and can also be adjusted to account for relative abundances. In the example (and associated manuscript), we calculate MPD using the function mpd() from the R package 'picante' with the option abundance.weighted=TRUE. As inputs, we provide a genus-level tree for the bacterial hosts associated with the phage genomes, and a data matrix where the rows are either phage genes or genomes (depending on the application) and the columns correspond to the unique host genera associated with the phage genomes. In this part of the walkthrough, we show: 1) How to generate a species-level tree for representative hosts associated with the phage genomes using a set of universal single-copy bacterial genes; 2) how to collapse the nodes of the species-level tree to create a genus-level tree; 3) how to create host association vectors for each phage gene (or genome by summing across genes); and 4) how to combine steps (2) and (3) as inputs for mpd() to estimate the diversity of hosts affiliated with each phage gene using mean phylogenetic distance.
Step One: Building a species-level tree of bacteria
a) Generate a table of phage-host associations. We previously did this as part of building a phage gene network, but now require representative host sequences to use for building a tree. We chose a representative sequence for each host species, using the exact genome used to isolate phage wherever possible. The file phagehostspecacc.txt (in Input Data) contains the phage accession, host species name, and representative host accession.
b) Create a FASTA file for a set of marker genes and build a BLAST database. We relied on the set of 24 single-copy universal genes previously used in Lang et al (2013). We used the corresponding sequence for each from E. coli (NC_000913) and created a BLAST database using makeblastdb from a FASTA file with each representative sequence. (Available as markerdb.tar.gz in the Input Data directory.)
c) BLAST all host genomes against the marker database to find the genes of interest. We used tblastx and retained the top hit from each genome to each marker gene.
d) Align each marker gene from across hosts. We made a FASTA file for each marker gene containing the associated sequence from each bacterial genome. Each file was then aligned using MUSCLE. (Other aligners such as MAFFT are also suitable.)
e) Concatenate the individual marker gene alignments into a single file. In the example, bacterial accessions are relabeled to remove any marker gene info (e.g. NC_000913_infB --> NC_000913) in each alignment. We then use a script, alignpaste(), to concatenate each alignment (available in R Scripts directory). alignpaste() takes in two FASTA files with the same headers (need not be in the same order) and concatenates the sequences within each header. Used here, one would first navigate to a folder containing all of the individual MUSCLE results (relabeled as above, if necessary, to only contain the genome accession). Then:
filelist=list.files() #create a list of each MUSCLE alignment file
file1=readLines(filelist[1]) #read in the first alignment file
file2=readLines(filelist[2]) #read in the second alignment file
alignfile=alignpaste(file1,file2) #use alignpaste() to concatenate them
for(i in 3:length(filelist)) #iterate through remaining alignments and use alignpaste to update the concatenated file
{
file2=readLines(filelist[i])
alignfile=alignpaste(alignfile,file2)
}
Then use write.table(alignfile, ... ) to save the concatenated alignment. Ours can be found as aligncat.fa in the Input Data folder.
f) Build a tree from the concatenated alignment. Any preferred software for generating a phylogeny can be used here. We used FastTree, but RAxML or other tools are also suitable.
The end result after step F will be a species-level tree of the representative host sequences, labeled according to the GenBank accession for these bacteria.
Step Two: Build a genus-level tree by collapsing nodes in the species tree that are from the same bacterial genus.
a) Relabel species tree by species name. E.g.:
require(ape)
hostmatch=as.matrix(read.csv("HostIDmatch.csv")) #read in a file that matches up the host names to the host accessions
tree=read.tree("hostspectree.tree") #read in the tree file
tipmatch=match(tree$tip.label,hostmatch[,2]) #match tree tips to the host accessions
treerelab=tree #create a new duplicate tree
treerelab$tip.label=hostmatch[tipmatch,1] #relabel the tips of the duplicate tree by species name, instead of accession
b) Extract the genus from the species name (note: in the example data, we assigned Chlamydophila to Chlamydia for purpose of collapsing nodes)
tiplabels=treerelab$tip.label
labelsplit=strsplit(tiplabels, " ") #Separates the 'words' in the labels, assuming a single space. Some outputs from earlier steps may use underscores and require "_" to split.
tipgenus=character()
for(i in 1:length(labelsplit)) #iterate through each tip and record just the first word (i.e. the genus)
{
tipgenus[i]=labelsplit[[i]][1]
}
c) Create genus tree by collapsing nodes with the same genus using the function drop.tip() from the package ape.
gentree=treerelab #create a copy of the tree relabeling each tip by the genus
#To correct for Chlamydophila in the example, run: gentree$tip.label[grep("Chlamydophila",tipgenus)]="Chlamydia"
genlist=unique(gentree$tip.label) #returns unique genus names present
for(i in 1:length(genlist)) #iterate through each unique genus, find the species-level labels that match in the original tree, and replace with the genus
{
gentree$tip.label[grep(genlist[i],gentree$tip.label)]=genlist[i]
}
#It is a good idea to plot the tree at this stage to make note of any misassigned nodes. For instance, one of the Burkholderia's in our sample gets misassigned to Mycobacterium and has an unusually long branch. We manually drop this tip (which corresponds to Burkholderia cepacia, associated with a single phage in the original dataset).
#to drop this particular tip, we find it first in the species tree and then drop since tip order is preserved
#gentree=drop.tip(gentree,grep("Burkholderia cepacia",treerelab$tip.label))
#Second, find all of the non-unique tips (there is no need to collapse tips that are unique)
temptree=gentree #create a new temporary variable to modify, rather than modifying the original
tipcount=numeric()
for(i in 1:length(genlist)){tipcount[i]=length(which(temptree$tip.label==genlist[i]))} #iterate through each genus, and record how many times it appears in the species tree.
tiplist=genlist[tipcount>1] #record the genera that appear more than once
#Third, iterate through each of the non-unique tip labels, collapse each, but saving the shared internal node, and relabel this node by the shared genus name
for(i in 1:length(tiplist))
{
tips=grep(tiplist[i],temptree$tip.label)
while(length(tips)>1) #because of tree topology, some genera may need to be collapsed multiple times. The while loop iterates until each is reduced to a single tip. The option "collapse.singles=FALSE" prevents loss of clades.
{
temptree=drop.tip(temptree,tips,trim.internal=FALSE,collapse.singles=FALSE)
numlabel=which(!is.na(as.numeric(temptree$tip.label))) #finds the new unlabeled tip by finding the one(s) only labeled by branch support
temptree$tip.label[numlabel]=tiplist[i]
tips=grep(tiplist[i],temptree$tip.label)
}
}
genustree=temptree #when complete, save the final genus tree in its own variable, if desired.
Export the final tree using write.tree(genustree, ... ). For visualization, we used iTOL (itol.embl.de).
Step Three: Creating host-association vectors
a) for genes
First, navigate to the directory containing each of the gene clusters. Creating an n x k matrix, where n is the number of phage genes and k the number of unique host genera, requires the phage-host association matrix (phostgen elsewhere in this walkthrough) and the list of phage genes. If the presence-absence matrix (e.g. pres35) is already loaded, the column names give the list of genes.
genelist=colnames(pres35) #list of genes in the network (can be recovered in a number of other ways, as well.)
unihosts=unique(phostgen[,2]) #list of unique hosts; can also use the tip labels from the tree if loaded.
ghostmat=array(0,dim=c(length(genelist),length(unihosts))) #initialize an empty array
colnames(ghostmat)=unihosts #set the column and row names
rownames(ghostmat)=genelist
for(i in 1:length(genelist)) #iterate through each gene, and record the values output by the function hostvecreturn() (available in R Scripts directory as part of hostassign.r) for each gene
{
temp=hostvecreturn(genelist[i],phostgen)
ghostmat[genelist[i],names(temp)]=as.numeric(temp)
}
b) for genomes
For each genome, the host association vector is the sum of the host association vectors for each of the genes within the genome. This step requires both the presence-absence matrix and the ghostmat array from the prior step.
phagelist=rownames(pres35) #set up the empty array to contain the genome host assocation vectors
phostmat=array(0,dim=c(length(phagelist),length(unihosts)))
colnames(phostmat)=unihosts
rownames(phostmat)=phagelist
for(i in 1:length(phagelist)) #iterate through each phage genome, find the genes it contains, and sum across their entires in the gene host association array
{
tempgenes=which(pres35[i,]==1)
if(length(tempgenes)==1){phostmat[i,]=ghostmat[tempgenes,]}
if(length(tempgenes)>1){phostmat[i,]=apply(ghostmat[tempgenes,],2,sum)}
}
Step Four: Calculate the MPD for genes and genomes
Calculating MPD is identical and requires either the ghostmat or phostmat arrays (which stand in for the taxon abundance data arrays required by the mpd() function in picante).
Gene-level example:
gmpdvec=mpd(ghostmat, genustree, abundance.weighted=TRUE) #the option abundance.weighted is required to account for the different relative representation of each host.
Genome-level example:
pmpdvec=mpd(phostmat, genustree, abundance.weighted=TRUE)
Step Five: Comparing network topology to host phylogenetic relatedness
a) Average shortest path and network similarity between all pairs of genes in two phage genomes
1) Find all shortest paths between any two genes
gdist=distances(gnet) #using distances() function in igraph and the gnet object calculated previously
2) Find all pairwise similarity values for genes in the gene network
gsim=similarity(gnet) #using similarity() function in igraph and the gnet object calculated previously
3) Find the average shortest path distance or similarity between two genomes
pdist=array(0,dim=c(length(phagelist),length(phagelist))) #set up a symmetric matrix for average distance between genes of any two phage genomes in the dataset
colnames(pdist)=rownames(pdist)=phagelist
psim=pdist #set up identical matrix for mean similarity values
for(i in 2:length(phagelist)){for(j in 1:(i-1)) #iterate over all unique pairs; no need to double-calculate
{
genes1=which(pres35[i,]==1)
genes2=which(pres35[j,]==1)
pdist[i,j]=mean(gdist[genes1,genes2])
psim[i,j]=mean(gsim[genes1,genes2])
}}
pdist=pdist+t(pdist) #make the matrices symmetric
psim=psim+t(psim)
Note: Some nodes are unreachable and will have values equal to Inf in gdist, whereas all values in gsim will be in the closed interval [0,1]. Importantly, it is not possible for only some genes in a genome to be unreachable in the gene network by others in another genome. This is a direct result of constructing the network such that two genes are connected if they're found in the same genome. As a simple proof, suppose to the contrary that the distance between gene 1 in genome A to gene 2 in genome B is Inf. Now suppose that gene 3 in genome A is only 2 edges away in the network from gene 4 in genome B. By definition, an edge must connect gene 3 to gene 1 since they are in the same genome, and similarly an edge must connect gene 2 to gene 4. Thus, if gene 3 is 2 edges from gene 4, then gene 1 and gene 2 must be separated by at most 3 edges, not Inf, as assumed. This fact is important, because it means that if the average distance between two genomes is Inf, it is not simply an artifact resulting from one pair of genes being Inf apart. Rather it reflects that none of their genes are in the same connected component of the overall network.
b) Find the phylogenetic distance between all pairs of hosts, and create matrix where each {i, j} is the distance between the host of phage i and the host of phage j.
1) calculate all host pairwise phylogenetic distances using the function cophenetic() from the package 'ape'
treedist=cophenetic(genustree)
2) record distances between each of pair of phages' hosts by using the above and phostgen
#a syntactical trick is to assign the phage accessions to rownames for phostgen
#rownames(phostgen)=phostgen[,1]
#the following could also be merged into the for loop in Step 4, part 3 above to save some time if desired.
phosttreedist=array(0,dim=c(length(phagelist),length(phagelist)))
colnames(phosttreedist)=rownames(phosttreedist)=phagelist
for(i in 2:length(phagelist)){for(j in 1:(i-1)) #iterate through each unique phage pair
{
phosttreedist[i,j]=treedist[phostgen[phagelist[i],2],phostgen[phagelist[j],2]]
}}
phosttreedist=phosttreedist+t(phosttreedist) #make symmetric
The resulting phosttreedist matrix has the same dimensions as psim and pdist and can be used for direct comparison, such as through regression, of phage host relatedness to either measure of phage relatedness based on gene network topology. (Note: if comparing to pdist, it's necessary to ignore any values equal to Inf.)