forked from BayLab/MarineGenomics
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-Week1.Rmd
546 lines (367 loc) · 21.5 KB
/
01-Week1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
---
title: "Marine Genomics"
author:
date:
output:
bookdown::html_book:
toc: yes
css: toc.css
---
```{r setupq1, include=FALSE}
knitr::opts_chunk$set(comment = "#>", echo = TRUE, fig.width=6)
```
# Week 1- Welcome!
Welcome to Marine Genomics Spring 2022 at UC Davis!
You will find the lecture for week one [here](https://github.com/BayLab/MarineGenomics/blob/main/ppt/MarineGenomics_Lecture1.pdf)
## Introduction to shell computing via the data carpentry tutorial
We will be following the data carpentry tutorial (Copyright 2016 @ Software Carpentry) "Introduction to the command line for genomics". We have made some modifications to the data carpentry tutorial to fit our course.
What is a shell and why should I care?
A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.
There are many reasons to learn about the shell:
+ Many bioinformatics tools can only be used through a command line interface, or have extra capabilities in the command line version that are not available in the GUI. This is true, for example, of BLAST, which offers many advanced functions only accessible to users who know how to use a shell.
+ The shell makes your work less boring. In bioinformatics you often need to do the same set of tasks with a large number of files. Learning the shell will allow you to automate those repetitive tasks and leave you free to do more exciting things.
+ The shell makes your work less error-prone. When humans do the same thing a hundred different times (or even ten times), they’re likely to make a mistake. Your computer can do the same thing a thousand times with no mistakes.
+ The shell makes your work more reproducible. When you carry out your work in the command-line (rather than a GUI), your computer keeps a record of every step that you’ve carried out, which you can use to re-do your work when you need to. It also gives you a way to communicate unambiguously what you’ve done, so that others can check your work or apply your process to new data.
+ Many bioinformatic tasks require large amounts of computing power and can’t realistically be run on your own machine. These tasks are best performed using remote computers or cloud computing, which can only be accessed through a shell.
In this lesson you will learn how to use the command line interface to move around in your file system.
## How to access the shell
For this course we will be using the shell in our Jetstream allocation through xsede. Jetstream is a cloud computing resource for which we have been allocated resources for the purposes of this course. Below is a guide for accessing and using jetstream.
In jetstream we launch what they call an "instance" which is a small allocation that specifies how much memory you need and reflects how much computing you might do (we'll guide you through this).
You'll find the jetstream login [here](https://use.jetstream-cloud.org/application)
Navigate to the login with xsede tab in the upper right
```{r one, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig1.png")
```
```{r two, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig2.png")
```
This will redirect you to the xsede login page. Your organization should say xsede.
Click continue
```{r three, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig3.png")
```
The username for our course is **margeno**. The password will be given out in class. If you missed it please contact Serena Caplins ([email protected]) or Maddie Armstrong ([email protected])
```{r four, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig4.png")
```
Once you've logged in you should be redirected to jetstream.
You now need to create your own projects folder. This is where you will carry out all of your analyses for the course.
Everyone will get one project folder. So please only make one. If you make a mistake you can delete it and start again.
```{r five, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig5.png")
```
```{r six, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig6.png")
```
```{r seven, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig7.png")
```
Once you have a project folder we can create our first instance. Go to the "New" tab and select "instance".
```{r eight, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig8.png")
```
Select the RStudio Desktop Shiny Server instance
```{r nine, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig9.png")
```
Select a small instance size if it isn't already selected (should be the default).
```{r ten, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig10.png")
```
It will take several minutes (5-10) to build our instance. Once it says active it will take a few more minutes to deploy.
```{r eleven, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig11.png")
```
```{r twelve, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig12.png")
```
Once it's ready to go you should see several actions on the right side of the screen including report, suspend, stop, etc.
Select open web desktop.
```{r thirteen, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig13.png")
```
You should see something like the little desktop above. It's not pretty but this is where we'll be spending a lot of time.
Select the black box on the bottom menu bar to access the command line. It will open a new window that has the `$` prompt.
```{r fourteen, echo=FALSE, out.width = '100%'}
knitr::include_graphics("./figs/jetstream/Fig14.png")
```
## Week 1 Objectives
Questions to Answer:
+ How can I perform operations on files outside of my working directory?
+ What are some navigational shortcuts I can use to make my work more efficient?
Main Tasks:
+ Use a single command to navigate multiple steps in your directory structure, including moving backwards (one level up).
+ Perform operations on files in directories outside your working directory.
+ Work with hidden directories and hidden files.
+ Interconvert between absolute and relative paths.
+ Employ navigational shortcuts to move around your file system.
## Navigating your file system
The part of the operating system responsible for managing files and directories is called the file system. It organizes our data into files, which hold information, and directories (also called “folders”), which hold files or other directories.
Several commands are frequently used to create, inspect, rename, and delete files and directories.
```html
$
```
The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.
Let’s find out where we are by running a command called pwd (which stands for “print working directory”). At any moment, our current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in, unless we explicitly specify something else. Here, the computer’s response is /home/margeno, which is the top level directory within our cloud system:
```html
$ pwd
```
```html
/home/margeno
```
Let’s look at how our file system is organized. We can see what files and subdirectories are in this directory by running `ls`, which stands for “listing”:
```html
$ ls
```
```html
Desktop Documents Downloads Music Pictures Public Templates Videos
```
`ls` prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns. We’ll make a new subdirectory MarineGenomics where we will be creating new subdirectories, throughout this workshop.
To make a new directory type the command `mkdir` followed by the name of the directory, in this case MarineGenomics.
```html
$ mkdir MarineGenomics
```
Check that it's there with ls
```html
$ ls
```
```html
Desktop Documents Downloads Music Pictures Public Templates Videos MarineGenomics
```
The command to change locations in our file system is cd, followed by a directory name to change our working directory. cd stands for “change directory”.
Let’s say we want to navigate to the MarineGenomics directory we saw above. We can use the following command to get there:
```html
$ cd MarineGenomics
```
```html
$ pwd
```
```html
/home/margeno/MarineGenomics
```
Use `ls` to see what is inside MarineGenomics
```html
$ ls
```
```html
```
It should be empty because we just created it and haven't put anything in it yet. Let's download some data to work with. We'll put it in our MarineGenomics directory.
There are two main ways to do this. We can use the command `wget` which needs a link to the file that we want to download. If theres a file saved on a website somewhere (anywhere on the internet) `wget` will download it for you. If our data file is on github, which is where most of our data will be stored, we'll use the command `git-clone` In this example we're going to download all the material in the MarineGenomicsData folder and save it in a folder called data.
```html
$ wget https://raw.githubusercontent.com/BayLab/MarineGenomicsData/main/week1_quarter.tar.gz
# use the tar command to uncompress the file. This will also automatically make a week1 folder
tar -xzvf week1_quarter.tar.gz
```
Now that we have something in our MarineGenomics directory we can use the `ls` command a bit more.
We can make the ls output more comprehensible by using the flag -F, which tells ls to add a trailing / to the names of directories:
```html
$ ls -F
```
```html
data/
```
Great, it's there! Let's cd into the data directory and then use ls to see what's in that directory.
```html
$ cd data
$ ls -F
```
```html
Week1 Week2 README.md
```
Anything with a “/” after it is a directory. Things with a “*” after them are programs. If there are no decorations, it’s a file.
`ls` has lots of other options. To find out what they are, we can type:
```html
$ man ls
```
`man` (short for manual) displays detailed documentation (also referred as man page or man file) for `bash` commands. It is a powerful resource to explore bash commands, understand their usage and flags. Some manual files are very long. You can scroll through the file using your keyboard’s down arrow or use the Space key to go forward one page and the `b` key to go backwards one page. When you are done reading, hit `q` to quit.
Use the `-l` option for the `ls` command to display more information for each item in the directory. What is one piece of additional information this long format gives you that you don’t see with the bare `ls` command?
```html
$ ls -l
```
No one can possibly learn all of these arguments, that’s what the manual page is for. You can (and should) refer to the manual page or other help files as needed.
Let’s go into the Week1 directory and see what is in there.
```html
$ cd Week1
$ ls -F
```
```html
SRR6805880_1.fastq SRR6805880_2.fastq
```
This directory contains two files with .fastq extensions. FASTQ is a format for storing information about sequencing reads and their quality. We will be learning more about FASTQ files in a later lesson.
## Shortcut: Tab Completion
Typing out file or directory names can waste a lot of time and it’s easy to make typing mistakes. Instead we can use tab complete as a shortcut. When you start typing out the name of a directory or file, then hit the Tab key, the shell will try to fill in the rest of the directory or file name.
Return to your home directory:
```html
$ cd
```
then enter
```html
$ cd Mar<tab>
```
The shell will fill in the rest of the directory name for `MarineGenomics`.
Now change directories to `Week1` in `data` in `MarineGenomics`
```html
$ cd MarineGenomics
$ cd data
$ cd Week1
```
Using tab complete can be very helpful. However, it will only autocomplete a file or directory name if you’ve typed enough characters to provide a unique identifier for the file or directory you are trying to access.
For example, if we now try to list the files which names start with SR by using tab complete:
```html
$ ls SR<tab.
```
The shell auto-completes your command to `SRR6805880_`, because all file names in the directory begin with this prefix. When you hit Tab again, the shell will list the possible choices.
```html
$ ls SRR68<tab><tab>
```
```html
SRR6805880_1.fastq SRR6805880_2.fastq
```
Tab completion can also fill in the names of programs, which can be useful if you remember the beginning of a program name.
```html
$ pw<tab><tab>
```
```html
pwck pwconv pwd pwdx pwunconv
```
Displays the name of every program that starts with `pw`.
## Summary & Key Points
We now know how to move around our file system using the command line. This gives us an advantage over interacting with the file system through a GUI as it allows us to work on a remote server, carry out the same set of operations on a large number of files quickly, and opens up many opportunities for using bioinformatic software that is only available in command line versions.
In the next few episodes, we’ll be expanding on these skills and seeing how using the command line shell enables us to make our workflow more efficient and reproducible.
+ The shell gives you the ability to work more efficiently by using keyboard commands rather than a GUI.
+ Useful commands for navigating your file system include: ls, pwd, and cd.
+ Most commands take options (flags) which begin with a -.
+ Tab completion can reduce errors from mistyping and make work more efficient in the shell.
\newpage
## Navigating Files and Directories
This continues the shell module from Data Carpentry's introduction to the shell, which can be found here https://datacarpentry.org/shell-genomics/02-the-filesystem/index.html
## Moving around the file system
We’ve learned how to use `pwd` to find our current location within our file system. We’ve also learned how to use cd to change locations and ls to list the contents of a directory. Now we’re going to learn some additional commands for moving around within our file system.
Use the commands we’ve learned so far to navigate to the `MarineGenomics/data/Week1` directory, if you’re not already there.
```html
$ cd
$ cd MarineGenomics
$ cd data
$ cd Week1
```
What if we want to move back up and out of this directory and to our top level directory? Can we type cd MarineGenomics? Try it and see what happens.
```html
$ cd MarineGenomics
```
```html
-bash: cd: MarineGenomics: No such file or directory
```
Your computer looked for a directory or file called MarineGenomics within the directory you were already in. It didn’t know you wanted to look at a directory level above the one you were located in.
We have a special command to tell the computer to move us back or up one directory level.
```html
$ cd ..
```
Now we can use `pwd` to make sure that we are in the directory we intended to navigate to, and ls to check that the contents of the directory are correct.
```html
$ pwd
home/margeno/MarineGenomics
```
```html
$ ls
data
```
From this output, we can see that `..` did indeed take us back one level in our file system.
You can chain these together like so:
```html
$ ls ../../
```
prints the contents of `/home`.
Let's explore a few more properties of the `ls` function
Go back to your home directory.
```html
$ cd
```
Let's look at some of the options for the `ls` function using the `man` command (note this will print out several lines of text)
```html
$ man ls
```
The `-a` option is short for all and says that it causes `ls` to “not ignore entries starting with .” This is the option we want.
```html
$ ls -a
. .ICEauthority .Rhistory .ansible .bash_logout .cache .dbus .fontconfig .local .r .ssh .wget-hsts Desktop Downloads Pictures Templates
.. .Renviron .Xauthority .bash_history .bashrc .config .emacs.d .gnupg .profile .rstudio-desktop .vnc .xsession-errors Documents Music Public Videos
```
You'll see there are many more files shown now that we can look at the hidden ones.
In most commands the flags can be combined together in no particular order to obtain the desired results/output.
```html
$ ls -Fa
$ ls -laF
```
## Examining the contents of other directories
By default, the `ls` commands lists the contents of the working directory (i.e. the directory you are in). You can always find the directory you are in using the `pwd` command. However, you can also give `ls` the names of other directories to view. Navigate to your home directory if you are not already there.
```html
$ cd
```
Then enter the command:
```html
$ ls MarineGenomics
data
```
This will list the contents of the MarineGenomics directory without you needing to navigate there.
The `cd` command works in a similar way.
Try entering:
```html
$ cd
$ cd MarineGenomics/data
```
This will take you to the MarineGenomics directory without having to go through the intermediate directory.
Navigating practice
Navigate to your home directory. From there, list the contents of the Week1 directory.
```html
$ cd
$ ls MarineGenomics/data/Week1
SRR6805880_1.fastq SRR6805880_2.fastq
```
## Full vs Relative Paths
The `cd` command takes an argument which is a directory name. Directories can be specified using either a relative path or a full absolute path. The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy.
Navigate to the home directory, then enter the `pwd` command.
```html
$ cd
$ pwd
```
You should see:
```html
/home/margeno
```
This is the full name of your home directory. This tells you that you are in a directory called `margeno`, which sits inside a directory called home which sits inside the very top directory in the hierarchy. The very top of the hierarchy is a directory called `/` which is usually referred to as the `root` directory. So, to summarize: margeno is a directory in `home` which is a directory in /. More on `root` and `home` in the next section.
Now enter the following command:
```html
$ cd /home/margeno/MarineGenomics/data/Week1
```
This jumps several levels to the `Week1` directory. Now go back to the home directory.
```html
$ cd
```
You can also navigate to the `Week1` directory using:
```html
$ cd MarineGenomics/data/Week1
```
These two commands have the same effect, they both take us to the `Week1` directory. The first uses the absolute path, giving the full address from the home directory. The second uses a relative path, giving only the address from the working directory. A full path always starts with a /. A relative path does not.
A relative path is like getting directions from someone on the street. They tell you to “go right at the stop sign, and then turn left on Main Street”. That works great if you’re standing there together, but not so well if you’re trying to tell someone how to get there from another country. A full path is like GPS coordinates. It tells you exactly where something is no matter where you are right now.
You can usually use either a full path or a relative path depending on what is most convenient. If we are in the home directory, it is more convenient to enter the full path. If we are in the working directory, it is more convenient to enter the relative path since it involves less typing.
Over time, it will become easier for you to keep a mental note of the structure of the directories that you are using and how to quickly navigate amongst them.
## Navigational shortcuts
The root directory is the highest level directory in your file system and contains files that are important for your computer to perform its daily work. While you will be using the root (/) at the beginning of your absolute paths, it is important that you avoid working with data in these higher-level directories, as your commands can permanently alter files that the operating system needs to function. In many cases, trying to run commands in root directories will require special permissions which will be discussed later, so it’s best to avoid them and work within your home directory. Dealing with the home directory is very common. The tilde character, ~, is a shortcut for your home directory. In our case, the root directory is two levels above our home directory, so cd or cd ~ will take you to `/home/margeno` and cd `/` will take you to `/`.
Navigate to the MarineGenomics directory:
```html
$ cd
$ cd MarineGenomics
```
Then enter the command:
```html
$ ls ~
Desktop Documents Downloads Music Pictures Public Templates Videos MarineGenomics
```
This prints the contents of your home directory, without you needing to type the full path.
The commands `cd`, and `cd ~` are very useful for quickly navigating back to your home directory. We will be using the ~ character in later lessons to specify our home directory.
## Key Points
+ The /, ~, and .. characters represent important navigational shortcuts.
+ Hidden files and directories start with . and can be viewed using ls -a.
+ Relative paths specify a location starting from the current location, while absolute paths specify a location from the root of the file system.
## Creature of the Week!
![Spanish Shawl spotted in San Diego](./figs/creatures/seaslug.png)