-
Notifications
You must be signed in to change notification settings - Fork 7
9. Changelog
Verena Kutschera edited this page Nov 27, 2024
·
39 revisions
Bug fixes
- The slurm profile configuration file for the PDC/KTH cluster Dardel (
config/slurm/profile/config_plugin_dardel.yaml
) has been fixed so that containers bind the correct directory - Documentation on how to run GenErode on Dardel has been updated (
config/slurm/README.md
) - FastP did not merge reads shorter than 30 bp with default settings. The parameters
--overlap_len_require 15 --overlap_diff_limit 1
have been implemented to ensure proper merging of shorter reads -
gerp_derived_alleles
did not process the last position of each chromosome/scaffold/contig which is now fixed -
bam2fasta
coded the first and last base of a mapped read as "N" when producing the fasta files for the outgroups used in GERP++, which is now fixed - Add the flag
-quick
to RepeatModeler (using sample sizes from before version 2.0.4 with same sensitivity as versions >= 2.0.4 but faster) and lower the number of threads but keeping memory the same in Dardel slurm profile to fix errors when running RepeatModeler2 of different versions with the White rhino and Sumatran rhino test data (round-2/tmptmpSample.fa
and*/sampleDB-round4.fa
missing)
Software updates
- Update bedtools version to 2.31.1 and htslib to 1.20 (container from sequera)
- Update samtools version to 1.20 (container from sequera; except for bam2pro and mlRho rules)
- Update RepeatModeler to version 2.0.5
Software versions
- python 3.12.3
- snakemake 8.14.0
- biopython 1.83
- matplotlib 3.8.4
- pandas 2.2.2
- numpy 1.26.4
- snakemake-executor-plugin-slurm 0.6.0
- bwa 0.7.17
- samtools 1.20 (mlRho rules are run in a container with samtools 1.9 and mlRho)
- picard 2.26.6
- repeatmodeler 2.0.5
- repeatmasker 4.1.5
- bedtools 2.31.1
- fastqc 0.12.1
- multiqc 1.9
- fastp 0.22.0
- qualimap 2.3
- gatk 3.7
- mapdamage 2.0.9
- bcftools 1.20
- mlrho 2.9
- plink 1.9
- vcftools v0.1.16
- snpeff 4.3.1
- seqtk 1.4
- gerp 2.1
New features
- Under
utilities/mutational_load_snpeff
, a new Snakemake pipeline has been added to process snpEff results for the purpose of calculating mutational load
Software updates
- Snakemake has been upgraded to version 8 with some larger changes in the source code. Most importantly for GenErode, the execution on slurm clusters has been implemented in Snakemake itself.
- Update QualiMap version to 2.3 (container from sequera)
- Switch Plink container to the container from by GalaxyProject
- Update BCFtools version to 1.20 (container from GalaxyProject)
- Update seqtk version to 1.4 (container from sequera)
- Switch GERP++ container to the container from by GalaxyProject
- Calculate memory inside of rules based on
mem_mb
provided underresources
instead of based onthreads
Software versions
- python 3.12.3
- snakemake 8.14.0
- biopython 1.83
- matplotlib 3.8.4
- pandas 2.2.2
- numpy 1.26.4
- snakemake-executor-plugin-slurm 0.6.0
- bwa 0.7.17
- samtools 1.9
- picard 2.26.6
- repeatmodeler 2.0.4
- repeatmasker 4.1.5
- bedtools 2.29.2
- fastqc 0.12.1
- multiqc 1.9
- fastp 0.22.0
- qualimap 2.3
- gatk 3.7
- mapdamage 2.0.9
- bcftools 1.20
- mlrho 2.9
- plink 1.9
- vcftools v0.1.16
- snpeff 4.3.1
- seqtk 1.4
- gerp 2.1
New features
- Option to remove sex chromosome-linked scaffolds/contigs from the final BCF files and downstream analyses. Can also be used to remove any other scaffolds/contigs.
Minor bug fixes and upgrades
- Update RepeatModeler version to 2.0.4 to be able to handle large genomes. With the new version, rules to copy RepeatModeler libraries and to run RepeatClassifier are not require anymore and are also removed.
- Update RepeatMasker version to 4.1.5 (from the new RepeatModeler container)
- Fix the input for rule missingness_filtered_vcf_multiqc so that it also works when GenErode is only run with modern or only with historical samples
- Remove
*.bai
files from mlRho rule input to avoid triggering of re-runs of mapping - Update FastQC version to version 0.12.1 with larger default memory allocation
- Replace the
rescale_gerp
rule with the gerpcol parameter-s 0.001
in thecompute_gerp
rule. The same functionality is ensured while there are less intermediate files and users can change the scaling parameter themselves if necessary for their project. - Fix file path for temporary fastp output file
- Fix the documentation regarding the input tree scaling for GERP which should be in millions of years, as (correctly) provided from timetree.org
- Add MultiQC reports for merged VCF file to the pipeline report
- Multiple changes to avoid triggering re-runs or duplication of files: keep merged VCF file for testing of missingness filters, do not copy the repeatmask-BED file from the reference location to the GenErode results directory
- Automatically determine memory allocation to
-Xmx
in GATK RealignerTargetCreator and IndelRealigner for more efficient memory use - Remove flag
-a
from RepeatMasker command so that *.fasta.align file is not created since it is not needed by downstream analyses
See https://github.com/NBISweden/GenErode/pull/58 for the code changes
Bug fixes
- Fix filter for missing data in merged VCF across all samples for
f_missing: 0.0
(no missing data allowed) andf_missing: 1.0
(any level of missing data allowed) - Correct input file name for rule
index_realigned_bams
See https://github.com/NBISweden/GenErode/pull/42 for the code changes
Bug fixes and upgrades
- Update file names of output files and corresponding code in the mitogenome mapping step to solve conflicts
- Upgrade conda environment file to install Snakemake version 7.20.0
- Add a slurm profile configuration file, compatible with current slurm profile and Snakemake version 7
- Update the GenErode pipeline report code to be compatible with Snakemake version 7.20.0
- Fix rule names in
cluster.yaml
file - Add a rule to
localrule
in the mitogenome mapping step
See https://github.com/NBISweden/GenErode/pull/35 for the code changes
Updates related to large genome sizes and/or large sample sizes
- Run snpEff with option to specify -Xmx for large genomes and add the rules to cluster.yaml
- Fix y-axis labels for mutational load plot so that there is no overlap for large sample sizes
- Create new Docker images with bedtools and htslib (bgzip) so that VCF files filtered with bedtools can be compressed in a pipe to reduce intermediate file sizes
Minor bug fixes
- Update conda in GitHub actions to reduce run time
- Shorten run time and lower number of cores for mutational load calculations in
cluster.yaml
- Remove
temp
flag from bam index file of rescaled bam files - Embed pipeline logo into GenErode pipeline report via link to file on repository so that the pipeline report can be moved to a different location
- Fix "rerun incomplete" warning for rule
make_reference_bed
by separating it from the group jobreference_prep_group
See https://github.com/NBISweden/GenErode/pull/23 for the code changes
- Release of public version of GenErode
Changes since version 0.4.0 (unpublished):
- Bug fix of python code to create output file lists for different CpG filtering methods
- Updated documentation
- Removed legacy code