Skip to content

Commit

Permalink
Merge pull request #161 from sanger-tol/dev
Browse files Browse the repository at this point in the history
2.1.0 release
  • Loading branch information
tkchafin authored Jan 13, 2025
2 parents d0ec90c + ee78df8 commit 26392b0
Show file tree
Hide file tree
Showing 129 changed files with 6,071 additions and 758 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
uses: actions/checkout@v3

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
uses: nf-core/setup-nextflow@v2
with:
version: "${{ matrix.NXF_VER }}"

Expand Down
35 changes: 35 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,41 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[2.1.0](https://github.com/sanger-tol/genomenote/releases/tag/2.1.0)] - Pembroke Welsh Corgi [2024-12-11]

### Enhancements & fixes

- New annotation_statistics subworkfow which runs BUSCO in protein mode and generates some basic statistics on the the annotated gene set if provided with a GFF3 file of gene annotations using the `--annotation_set` option.
- The genome_metadata subworkflow now queries Ensembl's GraphQL API to determine if Ensembl has released gene annotation for the assembly being processed.
- Module updates and remove Anaconda channels
- Removed merquryfk completeness metric

### Parameters

| Old parameter | New parameter |
| ------------- | ---------------- |
| | --annotation_set |

> **NB:** Parameter has been **updated** if both old and new parameter information is present. </br> **NB:** Parameter has been **added** if just the new parameter information is present. </br> **NB:** Parameter has been **removed** if new parameter information isn't present.
### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ---------------------------------------- | ---------------------------------------- |
| `agat` | | 1.4.0 |
| `bedtools` | 2.30.0 | 2.31.1 |
| `busco` | 5.5.0 | 5.7.1 |
| `cooler` | 0.8.11 | 0.9.2 |
| `fastk` | 427104ea91c78c3b8b8b49f1a7d6bbeaa869ba1c | 666652151335353eef2fcd58880bcef5bc2928e1 |
| `gffread` | | 0.12.7 |
| `merquryfk` | d00d98157618f4e8d1a9190026b19b471055b22e | |
| `multiqc` | 1.14 | 1.25.1 |
| `samtools` | 1.17 | 1.21 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]

### Enhancements & fixes
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ message: >-
metadata from this file.
type: software
authors:
- given-names: Sandra
family-names: Babiyre
- given-names: Sandra Ruth
family-names: Babirye
affiliation: Wellcome Sanger Institute
orcid: "https://orcid.org/0009-0004-7773-7008"
- given-names: Tyler
Expand Down
12 changes: 10 additions & 2 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@
## Pipeline tools

- [AGAT](https://github.com/NBISweden/AGAT)

> Dainat J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v1.4.0). Zenodo. https://www.doi.org/10.5281/zenodo.3552717
- [BedTools](https://bedtools.readthedocs.io/en/latest/)

> Quinlan, Aaron R., and Ira M. Hall. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics, vol. 26, no. 6, 2010, pp. 841–842., https://doi.org/10.1093/bioinformatics/btq033.
Expand All @@ -30,6 +34,10 @@
- [FastK](https://github.com/thegenemyers/FASTK)

- [GFFREAD](https://github.com/gpertea/gffread)

> Pertea G and Pertea M. "GFF Utilities: GffRead and GffCompare [version 1; peer review: 3 approved]". F1000Research 2020, 9:304 https://doi.org/10.12688/f1000research.23297.1
- [MerquryFK](https://github.com/thegenemyers/MERQURY.FK)

- [MultiQC](https://multiqc.info)
Expand All @@ -48,9 +56,9 @@
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
- [Conda](https://conda.org/)

> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.
> conda contributors. conda: A system-level, binary package and environment manager running on all major operating systems and platforms. Computer software. https://github.com/conda/conda
- [Bioconda](https://bioconda.github.io)

Expand Down
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.7949384-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.7949384)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A522.10.1-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=conda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/sanger-tol/genomenote)
Expand All @@ -13,7 +13,7 @@

## Introduction

**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, and (4) HiC primary mapped percentage from samtools flagstat.
**sanger-tol/genomenote** is a bioinformatics pipeline that takes aligned HiC reads, creates contact maps and chromosomal grid using Cooler, and display on a [HiGlass server](https://genome-note-higlass.tol.sanger.ac.uk/app). The pipeline also collates (1) assembly information, statistics and chromosome details from NCBI datasets, (2) genome completeness from BUSCO, (3) consensus quality and k-mer completeness from MerquryFK, (4) HiC primary mapped percentage from samtools flagstat and optionally (5) Annotation statistics from AGAT and BUSCO. The pipeline combines the calculated statistics and collated assembly metadata with a template document to output a genome note document.

<!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->

Expand All @@ -25,7 +25,9 @@
6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
8. Collated summary table ([`createtable`](bin/create_table.py))
9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
9. Optionally calculates some annotation statistics and completeness , ([`AGAT`](https://github.com/NBISweden/AGAT), [`BUSCO`](https://busco.ezlab.org))
10. Combines calculated statisics and assembly metadata with a template file to produce a genome note document.
11. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

## Usage

Expand Down
Binary file modified assets/genome_note_template.docx
Binary file not shown.
2 changes: 2 additions & 0 deletions bin/combine_parsed_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
("COPO_BIOSAMPLE_HIC", "copo_biosample_hic_file"),
("COPO_BIOSAMPLE_RNA", "copo_biosample_rna_file"),
("GBIF_TAXONOMY", "gbif_taxonomy_file"),
("ENSEMBL_ANNOTATION", "ensembl_annotation_file"),
]


Expand All @@ -42,6 +43,7 @@ def parse_args(args=None):
parser.add_argument("--copo_biosample_hic_file", help="Input parsed COPO HiC biosample file.", required=False)
parser.add_argument("--copo_biosample_rna_file", help="Input parsed COPO RNASeq biosample file.", required=False)
parser.add_argument("--gbif_taxonomy_file", help="Input parsed GBIF taxonomy file.", required=False)
parser.add_argument("--ensembl_annotation_file", help="Input parsed Ensembl annotation file.", required=False)
parser.add_argument("--out_consistent", help="Output file.", required=True)
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
Expand Down
18 changes: 14 additions & 4 deletions bin/combine_statistics_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

files = [
("CONSISTENT", "in_consistent"),
("STATISITCS", "in_statistics"),
("GENOME_STATISTICS", "in_genome_statistics"),
("ANNOTATION_STATISITCS", "in_annotation_statistics"),
]


Expand All @@ -19,7 +20,13 @@ def parse_args(args=None):
parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
parser.add_argument("--in_consistent", help="Input consistent params file.", required=True)
parser.add_argument("--in_inconsistent", help="Input consistent params file.", required=True)
parser.add_argument("--in_statistics", help="Input parsed genome statistics params file.", required=True)
parser.add_argument("--in_genome_statistics", help="Input parsed genome statistics params file.", required=True)
parser.add_argument(
"--in_annotation_statistics",
help="Input parsed annotation statistics params file.",
required=False,
default=None,
)
parser.add_argument("--out_consistent", help="Output file.", required=True)
parser.add_argument("--out_inconsistent", help="Output file.", required=True)
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
Expand All @@ -36,7 +43,7 @@ def process_file(file_in, file_type, params, param_sets):
reader = csv.reader(infile)

for row in reader:
if row[0] == "#paramName":
if row[0].startswith("#"):
continue

key = row.pop(0)
Expand Down Expand Up @@ -95,7 +102,10 @@ def main(args=None):
params_inconsistent = {}

for file in files:
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)
if file[0] == "ANNOTATION_STATISITCS" and args.in_annotation_statistics == None:
continue
else:
(params, param_sets) = process_file(getattr(args, file[1]), file[0], params, param_sets)

for key in params.keys():
value_set = {v for v in params[key]}
Expand Down
154 changes: 154 additions & 0 deletions bin/extract_annotation_statistics_info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env python3
import re
import csv
import sys
import argparse
import json


# Extract CDS information from mrna and transcript sections
def extract_cds_info(file):
# Define regex patterns for different statistics
patterns = {
"TRANSC_MRNA": re.compile(r"Number of mrna\s+(\d+)"),
"PCG": re.compile(r"Number of gene\s+(\d+)"),
"CDS_PER_GENE": re.compile(r"mean mrnas per gene\s+([\d.]+)"),
"EXONS_PER_TRANSC": re.compile(r"mean exons per mrna\s+([\d.]+)"),
"CDS_LENGTH": re.compile(r"mean mrna length \(bp\)\s+([\d.]+)"),
"EXON_SIZE": re.compile(r"mean exon length \(bp\)\s+([\d.]+)"),
"INTRON_SIZE": re.compile(r"mean intron in cds length \(bp\)\s+([\d.]+)"),
}

# Initialize a dictionary to store content for different sections
section_content = {"mrna": "", "transcript": ""}

# Variable to keep track of the current section being processed
current_section = None

with open(file, "r") as f:
lines = f.read().splitlines() # read all lines in the file

for line in lines:
line = line.strip() # Remove any leading/trailing whitespace including newline characters

if "---------------------------------- mrna ----------------------------------" in line:
current_section = "mrna" # Switch to 'mrna' section
elif "---------------------------------- transcript ----------------------------------" in line:
current_section = "transcript" # Switch to 'transcript' section
elif "----------" in line:
current_section = None # End of current section
elif current_section:
section_content[current_section] += (
line + " "
) # Accumulate content for the current section, separate lines by a space

cds_info = {}

for label, pattern in patterns.items():
text_to_search = section_content["mrna"] if label != "EXONS_PER_TRANSC" else section_content["transcript"]
match = re.search(pattern, text_to_search)
if match:
cds_info[label] = match.group(1)

return cds_info


# Function to extract the number of non-coding genes from the second file
def extract_non_coding_genes(file):
non_coding_genes = {"ncrna_gene": 0}

with open(file, "r") as f:
for line in f:
parts = line.split()
if len(parts) < 2:
continue

gene_type = parts[0]
try:
count = int(parts[1])
except ValueError:
continue

if gene_type in non_coding_genes:
non_coding_genes[gene_type] += count

NCG = sum(non_coding_genes.values())
return {"NCG": NCG}


# Extract the one_line_summary from a BUSCO JSON file
def extract_busco_results(busco_stats_file):
try:
with open(busco_stats_file, "r") as file:
busco_data = json.load(file)
# Extract the one_line_summary from the results section
one_line_summary = busco_data.get("results", {}).get("one_line_summary")
if one_line_summary:
# Use regex to extract everything after the first colon
match = re.search(r':\s*"(.*)"', one_line_summary)
if match:
one_line_summary = match.group(1) # Get text after the colon
return {"BUSCO_PROTEIN_SCORES": one_line_summary} if one_line_summary else {}
except (json.JSONDecodeError, FileNotFoundError) as e:
print(f"Error loading BUSCO JSON file: {e}")
return {}


# Function to write the extracted data to a CSV file
def write_to_csv(data, output_file, busco_stats_file):
busco_results = extract_busco_results(busco_stats_file)

descriptions = {
"TRANSC_MRNA": "The number of transcribed mRNAs",
"PCG": "The number of protein coding genes",
"NCG": "The number of non-coding genes",
"CDS_PER_GENE": "The average number of coding transcripts per gene",
"EXONS_PER_TRANSC": "The average number of exons per transcript",
"CDS_LENGTH": "The average length of coding sequence",
"EXON_SIZE": "The average length of a coding exon",
"INTRON_SIZE": "The average length of coding intron size",
"BUSCO_PROTEIN_SCORES": "BUSCO results summary from running BUSCO in protein mode",
}

with open(output_file, "w", newline="") as csvfile:
writer = csv.writer(csvfile)

# Write descriptions at the top of the CSV file
for key, description in descriptions.items():
csvfile.write(f"# {key}: {description}\n")

# Write the Variable and Value columns header
writer.writerow(["#paramName", "paramValue"])

# Write the data
for key, value in data.items():
writer.writerow([key, value])

# Add the BUSCO results summary
for key, value in busco_results.items():
writer.writerow([key, value])


# Main function to take input files and output file as arguments
def main():
Description = "Parse contents of the agat_spstatistics, buscoproteins and agat_sqstatbasic to extract relevant annotation statistics information."
Epilog = (
"Example usage: python extract_annotation_statistics_info.py <basic_stats> <other_stats> <busco_stats> <output>"
)

parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
parser.add_argument("basic_stats", help="Input txt file with basic_feature_statistics.")
parser.add_argument("other_stats", help="Input txt file with other_feature_statistics.")
parser.add_argument("busco_stats", help="Input JSON file for the BUSCO statistics.")
parser.add_argument("output", help="Output file.")
parser.add_argument("--version", action="version", version="%(prog)s 1.0")
args = parser.parse_args()

cds_info = extract_cds_info(args.other_stats)
non_coding_genes = extract_non_coding_genes(args.basic_stats)
data = {**cds_info, **non_coding_genes}
write_to_csv(data, args.output, args.busco_stats)


if __name__ == "__main__":
sys.exit(main())
Loading

0 comments on commit 26392b0

Please sign in to comment.