-
Notifications
You must be signed in to change notification settings - Fork 1
/
04.01-about_statistics.Rmd
138 lines (100 loc) · 9.79 KB
/
04.01-about_statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# (PART) STATISTICAL PROCEDURES {-}
# About statistics {#about-statistics}
Statistics is probably the most challenging step of holo-omic studies, due to two main factors: the extreme complexity of the data, often containing thousands of features, and the limited sample size, often in the realm of the dozens of sampling units. This combination renders many holo-omic datasets rather statistics unfriendly.
### A step-by-step approach {-}
In this workbook we strongly encourage researchers to proceed step-by-step when dealing with holo-omics data and biological questions.
#### Initial quantitative exploration of omic layers {-}
The analysis of any multi-omic data should begin with independent analysis of each omic layer to learn about its structure and variability before jumping to multi-omic data integration.
* **[Data transformations](#data-transformations):** multivariate datasets consist of different data types (e.g., presence-absence of taxa, counts of genes, community-level metabolic capacity index of a function, concentrations of metabolites across samples) that may require specific transformation before applying statistical techniques.
* **[Unsupervised exploration of omic layers](#unsupervised-exploration):** include exploratory techniques, such as cluster analysis and ordination-based visualisation methods, which reveal the structure and main patterns of the omic datasets without prior information about experimental design. These procedures might reveal that the observations are structured into meaningful groups or that variables can be reduced to fewer dimensions.
* **[Supervised analysis of omic layers](#supervised-analysis):** this type of analyses incorporate information of experimental design and aim at testing and estimating the effects of the experimental factors (e.g., dietary treatment, drug administration) or variables of interest (e.g., age of the experimental subjects, geographic location of studied populations) on different omic layers.
#### Multi-omic data integration {-}
When it comes to multi-omic data integration, the approaches can be broadly categorised into two types: multi-staged analysis and meta-dimensional or simultaneous analysis.
* **[Multi-staged integration](#multi-staged-integration):** leverages the central dogma of molecular biology to assume that the variation in omic datasets is hierarchical, such that variation in DNA leads to variation in RNA and so on to determine the phenotype
* **[Meta-dimensional integration](#meta-dimensional-integration):** considers the possibility that the phenotype is the product of the combination of variation across all omic layers, with the presence of complex inter-omic interactions.
:::: {.graybox}
All statistical analyses included in the **Holo-omics workbook** are conducted in R environment. You can find the details to set-up your R environment in the section [Prepare your R environment](#prepare-r).
:::
## Prepare your R environment {#prepare-r}
All statistical analyses included in the **Holo-omics workbook** are conducted in [R environment](https://en.wikipedia.org/wiki/R_(programming_language)) [@R_Development_Core_Team2008-gd]. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS, and in order to use it, [R](https://cran.r-project.org/) or [RStudio](https://posit.co/downloads/) must be installed in your local computer or remote server.
### Required packages {-}
In order to reproduce the analyses shown in the workbook, a rather long list of R packages must be installed. Packages are the fundamental units of reproducible R code, which include reusable R functions, the documentation that describes how to use them, and sample data.
* ape
* DESeq2
* distillR
* ggplot2
* tidyverse
* vegan
* (...)
### Package installation {-}
Packages are installed programatically using three main ways: through [CRAN](https://cran.r-project.org/web/packages/available_packages_by_name.html), [Bioconductor](https://www.bioconductor.org/packages/release/bioc/) or [Github](https://github.com/).
#### Install package from CRAN {-}
CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Packages stored in CRAN can be installed using the following code:
\small
```{r eval=FALSE}
install.packages("package_name")
#e.g.
install.packages("vegan")
```
\normalsize
#### Install package from Bioconductor {-}
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Packages included in Bioconductor can be installed using the following code:
\small
```{r eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("package_name")
#e.g.
BiocManager::install("DESeq2")
```
\normalsize
#### Install package from Github {-}
GitHub is a code hosting platform for version control and collaboration. Packages stored in R can be installed using the following code after installing the package devtools:
\small
```{r eval=FALSE}
library(devtools)
install_github("github_repository_name_of_the_package")
#e.g.
install_github("anttonalberdi/distillR")
```
\normalsize
## Create / clone a Github repository {#github-r}
RStudio provides the possibility to work with version-control projects that enable multiple users (e.g., collaborators or supervisor-student) to access and contribute to a project. In order to do so, one needs to first create a code repository in Github, or access an already existing repository, and create a version-control project in RStudio.
### Install git in you local computer {-}
The connection between RStudio and Github is done through the version control system **[Git](https://git-scm.com/)**. Instructions to download and install the software are available here: https://git-scm.com/downloads.
### Create a Github repository {-}
If you want to work on a new project, create a new repository in your Github dashboard of repositories or in any of the organisational accounts you belong to. When creating the repository, accept the option for generating a README.md document to serve as index.
### Create a version-control project {-}
Once your repository is created, you need to create a version-control R project linked to that repository in your local computer. Follow the following instructions to do so:
1. Open RStudio.
2. Click ***New Project***
3. Select the ***Version Control*** option in the first pop-up window, and the option ***Git*** in the second page.
4. Go to your repository page at Github and copy the git url after clicking the ***< > Code*** green button. The link should look something like this: https://github.com/holo-omics/example.git
5. Paste the repository URL you just copied in the ***Repository URL*** field. Leave the ***Project directory name*** empty, and select the local directory (folder) where you want to create a local copy of the Github repository.
6. Click ***Create Project***.
This procedure will generate an R project in the a folder with the name of the Github repository within the desired directory. The main difference of such a version-control project and a regular R project is the connection with Github through Git. If you open the project in RStudio, you will find a new tab called ***Git***, which you can use to communicate with Github.
### Set-up RStudio-Github connection {-}
In order to be able to commit and push changes to Github, you need to configure git and RStudio.
#### Set-up git config {-}
First, set-up git config using the following commands. Make sure you modify 'Your Name' and '[email protected]' before running the scripts below in your terminal.
```
git config --global user.name 'Your Name'
git config --global user.email '[email protected]'
```
#### Generate a Github access token {-}
Second, you need to generate a personal access token you will need to use when first pushing changes to Github. To generate the token, access your personal Github account and navigate through **Settings** (you will find it by clicking your avatar) > **Developer settings** (you will find it at the very bottom of the left menu) > **Personal access tokens** > **Generate new token**.
#### Connect RStudio to Github {-}
Finally, you will need to modify something in your project to push the first change to Github and configure the connection between RStudio and Github.
1. In RStudio open the README.md file, make any change you wish and save it.
2. Any file within the project that has been edited will appear in the ***Git*** tab with a **M - Modified** status.
3. Click the checkboxes of the files you want to commit (in this case the README.md file) to stage the file and click ***Commit***.
4. In the pop-up window, add a descriptive short message to the commit, and click ***Commit***.
5. Click Push to upload the changes to Github. The first time you push, RStudio will request your Github username and password. Note that instead of your Github access password, **you need to use the access token** instead.
These actions should enable you to connect RStudio to Github and to push changes from your local version of the repository to Github.
#### Work with version control {-}
Once the initial commit is pushed, then you are ready to work with your version-control R project. Keep in mind that in a version-control project, the same project might contain multiple versions, including the one in Github and the ones in the local computers of the collaborators.
- To access the latest version available in Github use the ***Pull*** button.
- To update the Github version, use ***Commit*** followed by ***Push***.
To learn more about Git and version control, check the many great tutorials available on the internet, for example:
- https://ourcodingclub.github.io/tutorials/git/
- https://www.freecodecamp.org/news/git-and-github-the-basics/