Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nurher #2

Open
wants to merge 688 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
688 commits
Select commit Hold shift + click to select a range
776a0d8
upd
nuriaher Feb 1, 2021
757f39f
upd
nuriaher Feb 1, 2021
2421804
upd
nuriaher Feb 1, 2021
a4fd3ef
upd
nuriaher Feb 2, 2021
9e19b93
upd
nuriaher Feb 2, 2021
0801fae
Update README.md
nuriaher Feb 2, 2021
94e70d9
Update README.md
nuriaher Feb 2, 2021
e917e9f
Update README.md
nuriaher Feb 2, 2021
b978540
Update README.md
nuriaher Feb 2, 2021
940a7d1
Update README.md
nuriaher Feb 2, 2021
ccd63c1
upd
nuriaher Feb 2, 2021
d250654
upd
nuriaher Feb 2, 2021
6313451
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 2, 2021
1e52bdb
Update README.md
nuriaher Feb 2, 2021
76e8dc3
upd
nuriaher Feb 2, 2021
9ebeac5
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 2, 2021
c207e26
Update README.md
nuriaher Feb 2, 2021
3941633
upd
nuriaher Feb 2, 2021
37a1e0c
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 2, 2021
3da97c4
upd
nuriaher Feb 2, 2021
ada6613
Update README.md
nuriaher Feb 2, 2021
0adf6cb
Update README.md
nuriaher Feb 2, 2021
885a082
upd
nuriaher Feb 3, 2021
10f7e2e
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 3, 2021
fd8d5d1
upd
nuriaher Feb 5, 2021
d6a8201
upd
nuriaher Feb 5, 2021
b7c8013
upd
nuriaher Feb 5, 2021
ed96ae8
upd
nuriaher Feb 5, 2021
944ed61
upd
nuriaher Feb 5, 2021
09bddb9
upd
nuriaher Feb 5, 2021
39da815
upd
nuriaher Feb 8, 2021
2325ae1
upd
nuriaher Feb 8, 2021
5fa5f21
upd
nuriaher Feb 9, 2021
1c8e064
upd
nuriaher Feb 10, 2021
11092b9
upd
nuriaher Feb 10, 2021
626844d
upd
nuriaher Feb 10, 2021
cb64061
Update README.md
nuriaher Feb 11, 2021
dda9fda
upd
nuriaher Feb 12, 2021
d054572
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 12, 2021
97a7feb
upd
nuriaher Feb 12, 2021
c03010e
upd
nuriaher Feb 12, 2021
f30d04b
upd
nuriaher Feb 15, 2021
423486b
upd
nuriaher Feb 15, 2021
7c48ecc
upd
nuriaher Feb 15, 2021
671dbb5
upd
nuriaher Feb 15, 2021
2bf40ac
Update README.md
nuriaher Feb 15, 2021
465c802
Update README.md
nuriaher Feb 15, 2021
c24c761
upd
nuriaher Feb 15, 2021
a64c3f5
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 15, 2021
ef9d5b1
upd
nuriaher Feb 15, 2021
bd35c22
upd
nuriaher Feb 15, 2021
ef3a668
upd
nuriaher Feb 15, 2021
28784e2
upd
nuriaher Feb 15, 2021
92687e2
upd
nuriaher Feb 15, 2021
27c194a
upd
nuriaher Feb 16, 2021
41ca815
upd
nuriaher Feb 16, 2021
a4839ae
upd
nuriaher Feb 16, 2021
4c64a54
upd
nuriaher Feb 16, 2021
7afb2ab
upd
nuriaher Feb 16, 2021
0552e06
upd
nuriaher Feb 16, 2021
d91246f
upd
nuriaher Feb 16, 2021
4bb1127
upd
nuriaher Feb 16, 2021
2b67392
upd
nuriaher Feb 18, 2021
2a009f8
upd
nuriaher Feb 18, 2021
d11c258
upd
nuriaher Feb 18, 2021
113f3ed
upd
nuriaher Feb 18, 2021
bd4633a
upd
nuriaher Feb 18, 2021
a64560a
upd
nuriaher Feb 18, 2021
f5b34e9
upd
nuriaher Feb 18, 2021
d468f13
Update README.md
nuriaher Feb 18, 2021
a54d2bf
+efficient
nuriaher Feb 19, 2021
812ce9e
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 19, 2021
70817c8
upd
nuriaher Feb 19, 2021
3bfcc63
upd
nuriaher Feb 23, 2021
75728a7
upd
nuriaher Feb 24, 2021
bfca777
upd
nuriaher Feb 25, 2021
6358052
Update README.md
nuriaher Feb 25, 2021
2b8528e
upd
nuriaher Feb 26, 2021
21ae429
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Feb 26, 2021
4d5d002
upd
nuriaher Feb 26, 2021
0ee276c
upd
nuriaher Feb 26, 2021
ff911b4
upd
nuriaher Feb 26, 2021
e283ddb
upd
nuriaher Mar 3, 2021
6256cc6
upd
nuriaher Mar 5, 2021
e9af841
upd
nuriaher Mar 5, 2021
8b439e5
upd
nuriaher Mar 5, 2021
fb29f37
upd
nuriaher Mar 5, 2021
55813f2
upd
nuriaher Mar 10, 2021
771f7ff
upd
nuriaher Mar 10, 2021
2faa4be
upd
nuriaher Mar 10, 2021
86703b8
upd
nuriaher Mar 10, 2021
7d8ccc3
upd
nuriaher Mar 12, 2021
f04d7d1
upd
nuriaher Mar 15, 2021
d44dd6c
upd
nuriaher Mar 15, 2021
01f15b8
upd
nuriaher Mar 15, 2021
8d2cccf
upd
nuriaher Mar 16, 2021
3a4fbb7
upd
nuriaher Mar 17, 2021
22075bb
upd
nuriaher Mar 17, 2021
00dbc56
upd
nuriaher Mar 19, 2021
34aff6c
Update README.md
nuriaher Mar 19, 2021
3287deb
Update README.md
nuriaher Mar 19, 2021
3e0af66
upd
nuriaher Mar 19, 2021
32f54d1
upd
nuriaher Mar 19, 2021
44be6b3
upd
nuriaher Mar 22, 2021
67a3527
upd
nuriaher Mar 22, 2021
f336fb9
upd
nuriaher Mar 22, 2021
8b3e4ad
upd
nuriaher Mar 22, 2021
c3d3f54
upd
nuriaher Mar 22, 2021
9eb868f
upd
nuriaher Mar 22, 2021
5cc1093
upd
nuriaher Mar 23, 2021
7ceb4fa
upd
nuriaher Mar 23, 2021
2c4a9d2
upd
nuriaher Mar 23, 2021
5ff7ff0
upd
nuriaher Mar 23, 2021
bc0d7e0
upd
nuriaher Mar 23, 2021
6cf5aec
upd
nuriaher Mar 23, 2021
1acc32c
upd
nuriaher Mar 24, 2021
d1e813a
upd
nuriaher Mar 24, 2021
9f42f14
upd
nuriaher Mar 24, 2021
bb9186c
upd
nuriaher Mar 25, 2021
00547f1
upd
nuriaher Mar 29, 2021
1f31f65
upd
nuriaher Apr 1, 2021
5ad8d78
upd
nuriaher Apr 1, 2021
5fa9859
upd
nuriaher Apr 1, 2021
f671f38
upd
nuriaher Apr 1, 2021
4b94913
upd
nuriaher Apr 1, 2021
395039c
upd
nuriaher Apr 1, 2021
e92ca3c
upd
nuriaher Apr 1, 2021
44fd3f7
Update README.md
nuriaher Apr 1, 2021
fc5a599
Update README.md
nuriaher Apr 1, 2021
fbfed3a
Update README.md
nuriaher Apr 1, 2021
ae04c70
Update README.md
nuriaher Apr 1, 2021
1cfa71e
upd
nuriaher Apr 2, 2021
55828b3
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Apr 2, 2021
958d493
upd
nuriaher Apr 2, 2021
2160fbf
upd
nuriaher Apr 2, 2021
96cfda8
upd
nuriaher Apr 12, 2021
2196e83
Update README.md
nuriaher Apr 15, 2021
8464754
upd
nuriaher Apr 15, 2021
f104214
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Apr 15, 2021
dfd16c5
Update README.md
nuriaher Apr 16, 2021
444518b
upd
nuriaher Apr 16, 2021
f1dad65
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Apr 16, 2021
2dbbadd
upd
nuriaher Apr 16, 2021
6a69662
upd
nuriaher Apr 16, 2021
39c69ed
upd
nuriaher Apr 16, 2021
d80cbca
upd
nuriaher Apr 16, 2021
14e134f
upd
nuriaher Apr 16, 2021
1c3b8a3
upd
nuriaher Apr 16, 2021
9c6cc60
upd
nuriaher Apr 22, 2021
d3175b4
upd
nuriaher Apr 22, 2021
2962591
upd
nuriaher Apr 22, 2021
4cca256
upd
nuriaher Apr 22, 2021
b934343
upd
nuriaher Apr 22, 2021
f3e50b4
Update README.md
nuriaher Apr 22, 2021
530b540
upd
nuriaher Apr 22, 2021
08250d9
upd
nuriaher Apr 22, 2021
b086631
upd
nuriaher Apr 22, 2021
ed161f6
upd
nuriaher Apr 22, 2021
462b14f
upd
nuriaher Apr 22, 2021
c5ec05b
upd
nuriaher Apr 23, 2021
280da42
upd
nuriaher Apr 23, 2021
4c4d4d4
upd
nuriaher Apr 23, 2021
b4c89e6
upd
nuriaher Apr 29, 2021
9460cca
upd
nuriaher Apr 29, 2021
bb47a87
upd
nuriaher Apr 29, 2021
1c5b80c
upd
nuriaher Apr 29, 2021
d45f535
upd
nuriaher Apr 29, 2021
6a1f341
upd
nuriaher Apr 29, 2021
3c87ac7
Update README.md
nuriaher Apr 29, 2021
cca9219
upd
nuriaher May 5, 2021
103b0be
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher May 5, 2021
eff682a
upd
nuriaher May 6, 2021
63cef2c
upd
nuriaher May 6, 2021
9f29882
upd
nuriaher May 6, 2021
1b66add
Update README.md
nuriaher May 6, 2021
639510a
upd
nuriaher May 6, 2021
132a376
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher May 6, 2021
e93ce8a
upd
nuriaher May 6, 2021
b4528aa
upd
nuriaher May 6, 2021
92b651d
upd
nuriaher May 6, 2021
fbef81c
upd
nuriaher May 6, 2021
52a334a
upd
nuriaher May 7, 2021
94a0f8c
upd
nuriaher May 7, 2021
1ae4865
upd
nuriaher May 11, 2021
0190a1c
upd
nuriaher May 11, 2021
620b0bd
upd
nuriaher May 11, 2021
c7cc648
upd
nuriaher May 11, 2021
f442430
upd
nuriaher May 11, 2021
4279272
upd
nuriaher May 11, 2021
d8a0727
upd
nuriaher May 11, 2021
7decbf9
upd
nuriaher Jun 1, 2021
7fb876d
upd
nuriaher Jun 1, 2021
8811d16
upd
nuriaher Jun 1, 2021
998a6ea
Update README.md
nuriaher Jun 1, 2021
66338b1
upd
nuriaher Jun 1, 2021
a1da238
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Jun 1, 2021
e135378
upd
nuriaher Jun 2, 2021
b40b819
upd
nuriaher Jun 2, 2021
7b6df29
upd
nuriaher Jun 2, 2021
acc3fe0
upd
nuriaher Jun 2, 2021
b7024ac
upd
nuriaher Jun 3, 2021
7bc762f
upd
nuriaher Jun 3, 2021
2654fa5
upd
nuriaher Jun 3, 2021
1c4c820
upd
nuriaher Jun 3, 2021
d9aea18
upd
nuriaher Jun 3, 2021
f11cd67
upd
nuriaher Jun 3, 2021
0601ccd
upd
nuriaher Jun 3, 2021
db2b1aa
upd
nuriaher Jun 4, 2021
c44227c
upd
nuriaher Jun 4, 2021
4f51fdf
upd
nuriaher Jun 7, 2021
7f6ca10
upd
nuriaher Jun 7, 2021
9c70fcf
upd
nuriaher Jun 7, 2021
75c6343
upd
nuriaher Jun 7, 2021
90dfce1
upd
nuriaher Jun 7, 2021
da075d9
upd
nuriaher Jun 7, 2021
01b76a5
upd
nuriaher Jun 8, 2021
e4f0ad2
upd
nuriaher Jun 8, 2021
88068c3
upd
nuriaher Jun 8, 2021
493838c
upd
nuriaher Jun 8, 2021
398e14f
upd
nuriaher Jun 10, 2021
939f3ee
Update README.md
nuriaher Jun 17, 2021
1cb1d9d
upd
nuriaher Jun 17, 2021
eccaf32
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Jun 17, 2021
073e03b
upd
nuriaher Jun 17, 2021
cefe315
upd
nuriaher Jun 17, 2021
5b72708
upd
nuriaher Jun 17, 2021
8bf2e5c
upd
nuriaher Jun 17, 2021
e76ca75
upd
nuriaher Jun 17, 2021
1885833
upd
nuriaher Jun 17, 2021
44cf525
upd
nuriaher Jun 17, 2021
931557f
upd
nuriaher Jun 17, 2021
1a03bd8
upd
nuriaher Jun 18, 2021
d0a404f
upd
nuriaher Jun 18, 2021
4387571
upd
nuriaher Jun 18, 2021
93cd4b7
upd
nuriaher Jun 18, 2021
8dfaa25
upd
nuriaher Jun 18, 2021
20f2344
upd
nuriaher Jun 22, 2021
f599e96
upd
nuriaher Jun 23, 2021
85f5750
Update README.md
nuriaher Jun 23, 2021
db1084b
Update README.md
nuriaher Jun 23, 2021
25cd1f1
upd
nuriaher Jun 23, 2021
4b7491d
Merge branch 'nurher' of https://github.com/anttonalberdi/holoflow in…
nuriaher Jun 23, 2021
105f711
upd
nuriaher Jun 23, 2021
a965fe0
upd
nuriaher Jun 24, 2021
686499c
upd
nuriaher Jun 24, 2021
16f5140
upd
nuriaher Jun 25, 2021
524e797
upd
nuriaher Jun 28, 2021
178d950
upd
nuriaher Jun 29, 2021
c42d76e
upd
nuriaher Jun 30, 2021
986afbe
upd
nuriaher Jun 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
316 changes: 313 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,317 @@
# holoflow
Bioinformatics pipeline for hologenomics data generation and analysis

module unload gcc/5.1.0
module load anaconda3/4.4.0
Snakemake is a workflow management system which requires from a *Snakefile* and a *config* file. This is a Bioinformatics pipeline implemented with Snakemake.

## Files and directories
### Main directory

The main *holoflow* directory contains a given number of Python scripts which work as launchers for the different **workflow programs** in the pipeline:

- ***preparegenomes.py*** - Merge all potential reference genomes to sample into a single *.fna* file to be used in preprocessing.py.
- ***preprocessing.py*** - Data preprocessing from quality to duplicate sequences for further downstream analysis.
- ***metagenomics_IB.py*** - Individual assembly-based analysis and metagenomics binning.
- ***metagenomics_CB.py*** - Coassembly-based analysis and metagenomics binning.
- ***metagenomics_AB.py*** - Functional annotation of (co-)assembly file with DRAM.
- ***metagenomics_DR.py*** - Dereplication and Annotation of metagenomic bins produced by either *metagenomics_IB* or *metagenomics_CB*.
- ***metagenomics_FS.py*** - Final statistical report of dereplicated bins obtained with *metagenomics_DR.py*.
- ***metagenomics_DI.py*** - Diet analysis from reads not mapped to MAG catalogue obtained in *metagenomics_FS.py*. ######### NOT FULLY FUNCTIONAL YET
- ***genomics.py*** - Variant calling, Phasing (for HD) and Imputation (for LD) with *genomics.py*.



These are designed to be called from the command line and require the following arguments:
```bash
REQUIRED ARGUMENTS:
-f INPUT File containing input information.
-d WORK_DIR Main output directory.
-t THREADS Thread maximum number to be used by Snakemake.
-W REWRITE Wants to re-run the worfklow from scratch: remove all directories previous runs. - NOT IN PREPAREGENOMES.
-g REF_GENOME Reference genome(s) file path to be used in read mapping. Unzipped for genomics. - only in PREPROCESSING, GENOMICS.
-adapter1 ADAPTER1 Adapter sequence 1 for removal. - only in PREPROCESSING.
-adapter2 ADAPTER2 Adapter sequence 2 for removal. - only in PREPROCESSING.
-Q DATA QUALITY] Low depth (LD) or High depth (HD) data set. - only in GENOMICS.
-vc VAR CALLER Variant caller to choose: 1 {bcftools/samtools}, 2 {GATK}, 3 {ANGSD}. - only in GENOMICS.
-N JOB ID ID of the sent job, so another different-N-job can be run simultaneously. - only in GENOMICS, METAGENOMICS IB, AB.

OPTIONAL ARGUMENTS:
-r REF_PANEL Reference panel necessary for likelihoods update and imputation of LD variants. - only in GENOMICS.
-k KEEP_TMP If present, keep temporal directories - NOT IN PREPAREGENOMES.
-l LOG Desired pipeline log file path.
-c CONFIG Configuration file full path.

```


### Config files description
A template *config.yaml* file can be found in every workflow directory.

### Input files description
A template *input.txt* file can be found in every workflow directory.
See *input.txt* file description for every workflow:
In all cases, columns must be delimited by a simple space and **no blank lines should be found in the end of the file**.
Those lines starting by # won't be considered.

##### *preparegenomes.py*

1. Reference genomes ID. **No spaces or undersquares** between different words in identifier.
2. Reference genome full path/name.
3. Desired output data base with all genomes name. **No spaces**, undersquares or other separators allowed. *All those reference genomes which should be in the same DB should have the same ID in this field*.

**The fields 1 and 3 must be different**

- Example:

*Heads-up*: you can generate more than one DB at a time for different projects, be aware that preprocessing only takes ONE DB at a time with all reference genomes to be mapped to a set of samples in a given project.

| | | |
| --- | --- | --- |
| Genomeone | /home/Genomeone.fq | DBone |
| Genometwo | /home/Genometwo.fq.gz | DBtwo |
| Genomethree | /home/Genomethree.fq | DBone |
| Genomen | /home/Genomen.fq | DBn |


##### *preprocessing.py* & *metagenomics_IB.py*

1. Sample name.
2. Original full path/name of **FORWARD** input file. This can be both *.gz* or not compressed.
3. Original full path/name of **REVERSE** input file. This can be both *.gz* or not compressed.

- Example:

| | | | |
| --- | --- | --- | --- |
| Sample1 | /home/Sample1_1.fq | /home/Sample1_2.fq |
| Sample2 | /home/Sample2_1.fq | /home/Sample1_2.fq |
| Samplen | /home/Samplen_1.fq | /home/Samplen_2.fq |


##### *metagenomics_CB.py*

1. Sample name.
2. Coassembly group: **assumed to be the same as in preprocessing -N job if preprocessing has been run (PPR_03-MappedToReference job directory ID)**.
3. Original full path/name of **FORWARD** input file.
4. Original full path/name of **REVERSE** input file.
Optimally the metagenomic .fastq files would come from PPR_03-MappedToReference, the last preprocessing step.

- Example:

| | | | |
| --- | --- | --- | --- |
| Sample1 | CoassemblyGroup1 | /home/Sample1_1.fq | /home/Sample1_2.fq |
| Sample2 | CoassemblyGroup2 | /home/Sample2_1.fq | /home/Sample1_2.fq |
| Samplen | CoassemblyGroup3 | /home/Samplen_1.fq | /home/Samplen_2.fq |


##### *metagenomics_AB.py*

1. (Co-)Assembly or group ID.
2. Path to assembly file.

- Example:

| | | |
| --- | --- | --- |
| GroupA | /home/dir/assembly_A.fa |
| GroupB | /home/second/dir/assembly_B.fna.gz |


##### *metagenomics_DR.py*

1. Coassembly group or sample group name.
2. Input directory path where all *.fa* bins to dereplicate and the respective *ID*_DASTool_summary.txt files are.

- Example:

| | | |
| --- | --- | --- |
| GroupA | /home/directory_samplesA |
| GroupB | /home/directory_samplesB |


##### *metagenomics_FS.py*

1. Coassembly group or sample group name.
2. Input directory path where the group's/samples' in the group original metagenomic *_1.fastq* & *_2.fastq* files are.
3. Input directory path where all dereplicated *.fa* bins are.
4. Input directory path where .gff annotation files respective to each dereplicated bin is found.

- Example:

| | | | |
| --- | --- | --- | --- |
| DrepGroup1 | /home/PPR_03-MappedToReference/DrepGroup1 | /home/MDR_01-BinDereplication/DrepGroup1/dereplicated_genomes | /home/MDR_02-BinAnnotation/DrepGroup1/bin_funct_annotations |
| DrepGroup2 | /home/PPR_03-MappedToReference/Sample1 | /home/MDR_01-BinDereplication/Sample1/dereplicated_genomes | /home/MDR_02-BinAnnotation/DrepGroup2/bin_funct_annotations |
| DrepGroup2 | /home/PPR_03-MappedToReference/Sample2 | /home/MDR_01-BinDereplication/Sample2/dereplicated_genomes | /home/MDR_02-BinAnnotation/DrepGroup3/bin_funct_annotations |


##### *metagenomics_DI.py* ######### NOT FULLY FUNCTIONAL YET

1. Group ID.
2. Path to assembly file.
3. Path to .fastq files which contain reads not mapped to MAG catalogue.

- Example:

| | | |
| --- | --- | --- |
| GroupA | /home/dir/assembly_A.fa | /home/dir/MFS_01-MAGUnMapped/GroupA |
| GroupB | /home/second/dir/assembly_B.fna.gz | /home/dir/MFS_01-MAGUnMapped/GroupB |


##### *genomics.py*

1. Sample group name to analyse.
2. Path to directory containing host reads BAM alignment sorted files - If *preprocessing.py* was used, these are the resulting *ref* BAMs path.
3. Chromosome list. This should be a text file with a single column depicting chromosome IDs. Note that **the given chromosome IDs should be in accordance with the provided reference genome**, otherwise these won't be detected by Holoflow. Relevantly, if the used **reference genome does not have chromosomes**, the user can choose to analyse her dataset as one single chromosome, by only writing **ALL** in the chromosome list.

- Example:

| | | |
| --- | --- | --- |
| Group1 | /home/path/to/group1/bams | /home/path/to/group1_chrlist.txt |
| Group2 | /home/path/to/group2/PPR_03-MappedToReference | /home/path/to/group2_chrlist.txt |
| Groupn | /home/path/to/groupn/bams | /home/path/to/groupn_chrlist.txt |



### Workflows - Specific directories

#### Preparegenomes
- *Snakefile* - Continuing *preparegenomes.py*'s job, which takes as input the full paths of the given reference genomes, reformats its read IDs and merges them into a single *data_base.fna* file, the *Snakefile* contains rules for:
1. Indexing the resulting DB using **bwa** and **samtools**
2. Compressing the full set of DB-related files into a *data_base.tar.gz* file.


#### Preprocessing
- *Snakefile* - which contains rules for:
1. Quality filtering using **AdapterRemoval**
2. Duplicate read removal using **seqkit rmdup**
3. Mapping reads against reference genome(s) using **bwa mem**

- Config file *config.yaml*, in which the user may be interested in customising:
1. Quality filtering - specific adapter sequences, minimum quality, character separating the mate read number.


#### Metagenomics - Individual Assembly & Coassembly
- *Snakefile* - which contains rules for:
1. Metagenomic assembly using **megahit**. In Individual Assembly also **metaSpades** available.
2. Read mapping to assembly using **bwa mem**
3. Contig binning using **Metabat**, **MaxBin**. In Coassembly also binning by **Concoct**.
4. Binner result integration using **DasTool**

- Config file *config.yaml*, in which the user may be interested in customising:
1. Assembler - choose between the mentioned options by writing *megahit* or *spades*
2. Minimum contig length - minimum bp per contig in final assembly file.


#### Metagenomics - Assembly Based
- *Snakefile* - which contains rules for:
1. DRAM functional annotation and distilling of an assembly file.


#### Metagenomics - Dereplication
- *Snakefile* - which contains rules for:
1. Bin Dereplication using **dRep**.
2. Bin Gene Annotation with **Prokka**.
3. Bin Taxonomic Classification with **GTDB-Tk**.
4. Obtain GTDB phylogenetic subtree of MAGs.


#### Metagenomics - Final Statistics
- *Snakefile* - which contains rules for:
1. Mapping metagenomic reads to dereplicated MAGs - number and % of mapped reads.
2. Obtaining coverage statistics of contigs and MAGs in used samples.
3. Retrieve quality statistics (CheckM) and summary plot of the MAGs.
4. Get coverage of KEGG KO single-copy core genes in MAGs.


#### Metagenomics - Dietary Analysis ######### NOT FULLY FUNCTIONAL YET
- *Snakefile* - which contains rules for:
1. ORF prediction.
2. Annotation based on reference diet protein DB - so far Invertebrates and/or Plants.
3. Map unmapped to MAG Catalogue reads to gene catalogue obtained in step 1.
4. Extract gene abundances and merge output with annotations.

- Config file *config.yaml*, in which the user may be interested in customising:
1. Reference DB used for annotation {Plants, Invertebrates, Invertebrates_Plants/Plants_Invertebrates}


#### Genomics
- *Snakefile* - which contains rules for:
a. Variant calling with **BCFtools**, **GATK** or **ANGSD** (## Latter UNDER CONSTRUCTION ##)

-> *High depth samples*
b. Filtering with **BCFtools** or **GATK**
c. Phasing with **shapeit4**

-> *Low depth samples*
b. Likelihoods update with **Beagle** using a high-depth reference panel
c. Genotype imputation with **Beagle**

- Config file *config.yaml*, in which the user may be interested in customising:
1. Choose between HD - for high depth seqs OR LD - for low depth seqs.
2. Variant calling - BCFtools
- mpileup
* Coefficient for downgrading mapping quality for reads containing excessive mismatches - *degr_mapp_qual*. Default 50.
* Minimum mapping quality - *min_mapp_qual*. Default to 0.
* Minimum base quality - *min_base_qual*. Default to 13.
* Specific chromosome region. Default False.
- call
* Multicaller mode: alternative model for multiallelic and rare-variant calling designed to overcome known limitations.
* Keep only variants and not indels.

3. Variant calling - GATK
* Parameters to obtain more agressive variants: *min_pruning* and *min_dangling*.

4. Variant calling - ANGSD
* Choose model (1/2) between samtools or GATK.
* Output log genotype likelihoods to a file or not.
* How to estimate minor and major alleles (1/2): 1 = from likelihood data ; 2 = from count data.
* Estimate posterior genotype probability based on the allele frequency as a prior (True/False).
5. HD Filtering - BCFtools
* Quality of SNPs that want to be kept. Default to 30.
6. HD Filtering - GATK
* Quality of SNPs that want to be kept. Default to 30.
* QD: Quality by depth. Find more information [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants).
* FS: Fisher strand. Find more information [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants).

7. HD Phasing
* --geno filters out all variants with missing call rates exceeding the provided value to be removed. Default to 0.
* Provide a Genetic map. Default to False, else provide path.


## Usage in Computerome

### Get started: download Holoflow repository
Clone the repository by running the following command on your command line:

```bash
git clone -b nurher --single-branch https://github.com/anttonalberdi/holoflow.git
```

### Execute Holoflow *.py* workflow launchers
These should be **executed as jobs**, therefore a *.sh* script should be generated which will call the desired Holoflow workflow:

- *.sh* example script for *preprocessing.py* called ***first_job_preprocessing.sh***:
```bash
#Declare full path to the project directory (the .sh file will be stored here as well)
projectpath=/full/path/project1
#Declare full path to holoflow
holoflowpath=/full/path/holoflow
#Run holoflow
python ${holoflowpath}/preprocessing.py -f ${projectpath}/input.txt -d ${projectpath}/workdir -g ${projectpath}/reference_genomes.fna -adapter1 'ATGCT' -adapter2 'CTTGATG' -c ${projectpath}/config.yaml -l ${projectpath}/log_file.log -t 40 -N First_job
```

- *job execution* in Computerome2 example:
```bash
qsub -V -A ku-cbd -W group_list=ku-cbd -d `pwd` -e ${projectpath}/job_error_file.err -o ${projectpath}/job_out_file.out -l nodes=1:ppn=40,mem=180gb,walltime=5:00:00:00 -N JOB_ID ${projectpath}/first_job_preprocessing.sh

```
Note that the job parameters: *ppn*, *nodes*, *memory*, *wall time* ... can and ought to be customised optimally for every job type.





snakemake -s Snakefile -n -r ${workdir}/02-DuplicatesRemoved/H2A_1.fastq ${workdir}/02-DuplicatesRemoved/H2A_2.fastq
Loading