-
Notifications
You must be signed in to change notification settings - Fork 1
/
shell.Rmd
206 lines (144 loc) · 10.3 KB
/
shell.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
# Running TreeSAPP on a server
```{r setup-shell, echo = FALSE, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(eval = FALSE)
```
These instructions will help you to complete the script template `treesapp_analysis.sh` to run your TreeSAPP analysis of the Saanich Inlet data of your assigned depth on the server and copy the output files to your local machine. To adjust the code examples below and in the script template, replace any angle brackets, e.g. `<SOME TEXT>`, with the missing code.
## Resources for TreeSAPP analysis
- A shell script template called `treesapp_analysis.sh` on Canvas.
- Access to a server and `<server address>`.
- The Saanich Inlet data located in `/mnt/datasets`
## Checklist to write and run shell script
- [ ] Edit the `treesapp_analysis.sh` script following [the instructions below](#shell-writing).
- [ ] Using the terminal on your computer, copy the script to your group's server using:
```{bash}
scp <path to file on your computer> <user>@<server address>:<destination>
```
- [ ] Make the script executable on the server (otherwise you won't be able to run it). In the CLI and while in the directory containing your script, run the following command:
```{bash}
chmod u+x treesapp_analysis.sh
```
- [ ] Still in the CLI, start a new session with `screen -S <session name>`.
- [ ] In the new session, start the script with `./treesapp_analysis.sh`.
- [ ] After the script has run (will take at most 8 hours), locate the `final_output/marker_contig_map.tsv` files in the TreeSAPP output directories and copy each one to your local computer using `scp`.
After these steps, you will continue with the analysis of your results in R on your local computer using R and the provided `treesapp_analysis.R` script template.
## Writing the shell script {#shell-writing}
In the `treesapp_analysis.sh` script, we will define some variables to make our code more readable when running commands, give some feedback to the user as the script runs, and finally run the TreeSAPP workflow. Edit the script either on your local computer or after you have copied it to the server using the text editor of your choice (e.g. RStudio on your local machine or `nano` on the server).
### Defining the shell
The script template starts with:
```{bash}
#! bin/bash
```
and defines the shell used to interpret this script and where the shell is located (i.e. the `bash` shell in the `/bin` directory). You do not need to modify anything about that step.
### Setting a variable for your depth
Let's define our first variable with the name "depth". Replace `<your depth>` with the depth your group has been assigned to (e.g. `100`).
```{bash}
depth="<your depth>"
```
### Providing user feedback
It is good practice to provide feedback to the user while a script is running by printing messages to the terminal. Let's print a welcome message to the terminal using the `echo` command to informing the user what the script is going to do. This is a great opportunity to call the `depth` variable that we have just defined.
*Hint*: Remember, to call a variable, you have to prepend its name with a `$`, e.g. if the variable's name is `some_variable` then you call it with `$some_variable`.
```{bash}
echo "TreeSAPP analysis of Saanich Inlet Data at Depth <call depth variable> m"
```
### Defining variable for input directory
Define a variable for path to the input data directory. Let's call the variable `input_dir`.
```{bash}
<variable_name>="<path to data directory>"
```
### Creating output directory
Define a variable for the TreeSAPP output directory and choose a name for it yourself. The directory should be located in your home directory (which is also shared between your host and container). To refer to your home directory, use the environmental variable `HOME` which is automatically defined by the shell (i.e. you do not need to first define it yourself). Finally, create the output directory.
```{bash}
<set output directory variable>
<create the ouput directory>
```
### Reporting variables to user
Let's provide the user some feedback where TreeSAPP is going to save its output and what bind paths have been set.
```{bash}
<print command> "<Some text explaining what variables have been set>"
```
### Looping over multiple files
For-loops minimize the amount of redundant code and ensure all data is processed similarly.
Below is an example for-loop that could be used to classify metagenome sequences with treesapp assign.
```{bash}
for f in <path/to/assembly_files>/SI072_*m_contig.fa
do sample=$( basename $f | sed 's/_contig.fa//g')
treesapp assign -i $f \
--refpkg_dir <path_to_refpkgs/> \
--output <path/to/outputs>/${sample}_assign \
--trim_align -n 8
done
```
This loop iterates over the seven metagenome assembly files that match the pattern 'SI072_*m_contig.fa', where the asterisk represents any characters one or more times.
Effectively, it replaces seven nearly identical `treesapp assign` commands.
A variable called 'sample' is created by calling `basename` on the file path to return the filename, and using the `sed` (stream editor) to remove the assembly-specific suffix '_contig.fa'.
The name of the output directory is based on this 'sample' variable to ensure the directories are unique across metagenomes.
This for-loop can be modified to loop over `treesapp abundance` commands as well (assuming you've followed the above naming convention for the `treesapp assign` output directory):
```{bash}
for f in <path/to/outputs>/SI072_*assign
do sample=$( basename $f | sed 's/_assign//g')
treesapp abundance \
--treesapp_output $f \
--reads <path/to>/MetaG_Trim_QC_Reads/${sample}_pe.1.fq.gz \
--reverse <path/to>/MetaG_Trim_QC_Reads/${sample}_pe.2.fq.gz \
-n 8 \
--report update
done
```
### Listing TreeSAPP version
Let's print the information about the configuration of TreeSAPP.
```{bash}
treesapp info
```
*Hint*: When we have very long commands in a script, we can introduce line breaks using `\` which are best used right after a space in the command (see example above after the word "following").
### Executing TreeSAPP analysis
In the terminal (i.e. not as part of this script), check the documentation for `treesapp assign`, the command you will use for the automated reconstruction of a nutrient cycle (visit its [wikipage](https://github.com/hallamlab/TreeSAPP/wiki/Classifying-sequences-with-treesapp-assign) to learn more).
```{bash}
treesapp assign -h
```
Looking at the TreeSAPP documentation, determine the flags you need in addition to the ones already included below.
Time to run our analysis! You will have to identify the input file(s) that you need to process. You will run the analysis once for metagenomic (two files, forward reads are in the file that contains `.1` in its name, and reverse reads in file the with `.2`) and once for metatranscriptomic reads (a single file that contains both forward and reverse reads) and align them to the assembled contigs. All reads are pair-end.
You will need to provide the following settings using flags:
- Which assembly file to use.
- To use 8 CPUs for the analysis.
- To do the analysis for all marker genes provided in TreeSAPP.
- To output a verbose runtime log.
- Where to save the results.
- To calculate the rpkm values for detected sequences.
- Which reads to use and what type they are (i.e. pair-end or single-end).
- To delete any intermediate files generated by the analysis.
- To use position masking of multiple sequence alignment.
*Hints*:
- Will you ouput the data for metagenomic and metatranscriptomic analysis to the same folder? Does the folder already exist?
- If you want the default value of a flag, then you do not need to provide it.
- Where is the input data located in the singularity container?
- Metatrascriptomic reads are **NOT** rRNA.
- In scripts you often refer to variables while combined with additional text. Let's say you have defined the variable `file_name="some_file"` to refer to the file `some_file`. Now you want to save a modified version of the file as `some_file_modified`. If you would write `$file_name_modified`, then the shell does not know that you just want to call a variable called `$file_name` and instead will assume you are calling a variable called `$file_name_modified` and will return an empty string. To indicate the beginning and end of a variable name, use curly braces. In the above example, you would use `${file_name}_modified`.
Optional bonus challenge:\
Earlier we defined the `depth` variable. Could you use it in some way when you define which assembly and reads you are using in the analysis?
```{bash}
# Metagenomic analysis
treesapp assign \
-i <input data dir><file path to your assembly> \
-m <choose the correct option> \
<number of CPU threads, set to 8>
<your output directory> \
<your remaining settings>
# Metatranscriptomic analysis
<you are on your own now :)>
```
### Concatenating classification tables
Once the `treesapp assign`, `treesapp abundance` and `treesapp layer` commands have completed successfully it is convenient to store classification data from all samples in a single table.
This one table can then be downloaded from the server and loaded into R.
You can combine all tables with two lines of code in bash, using `cat`, `head`, and `tail`.
The name of the combined table for this example is 'SI072_classifications.tsv'.
This first line writes *just* the header from one of the classification tables to the combined table while the second writes the rows (i.e. not including the column names) to the combined table.
This ensures the final table is properly formatted, i.e., one row containing the column names and the remaining rows containing data.
```{bash}
cat /data/SI072_*m_assign/final_outputs/classifications.tsv | head -n 1 >SI072_classifications.tsv
tail -q -n +2 /data/SI072_*m_assign/final_outputs/classifications.tsv >>SI072_classifications.tsv
```
The above command will work for both the classifications.tsv file created by `treesapp assign` or the layered_classifications.tsv file created by `treesapp layer` if the file name is changed from 'classifications.tsv' to 'layered_classifications.tsv' in the commands above.
```{r eval = TRUE, echo = FALSE, warning = FALSE, message = FALSE}
# Reset code chunk settings to `default_chunk_options` (defined in `index.Rmd`)
knitr::opts_chunk$set(default_chunk_options)
```