Merge pull request #146 from sanger-tol/dev

Release 2.0.0
sanger-tol · Oct 10, 2024 · d0ec90c · d0ec90c
2 parents 2208ff8 + 5a54958
commit d0ec90c
Show file tree

Hide file tree

Showing 49 changed files with 3,403 additions and 57 deletions.
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -19,7 +19,7 @@ jobs:
       - uses: actions/setup-node@v3
 
       - name: Install editorconfig-checker
-        run: npm install -g editorconfig-checker
+        run: npm install -g editorconfig-checker@3.0.2
 
       - name: Run ECLint check
         run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|cff\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
@@ -32,7 +32,7 @@ jobs:
       - uses: actions/setup-node@v3
 
       - name: Install Prettier
-        run: npm install -g prettier
+        run: npm install -g prettier@3.1.0
 
       - name: Run Prettier --check
         run: prettier --check ${GITHUB_WORKSPACE}

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -20,3 +20,4 @@ lint:
   multiqc_config:
     - report_comment
   actions_ci: false
+  template_strings: False
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,36 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[2.0.0](https://github.com/sanger-tol/genomenote/releases/tag/2.0.0)] - English Cocker Spaniel [2024-10-10]
+
+### Enhancements & fixes
+
+- New genome_metadata subworkflow to fetch metadata linked to the genome assembly from various sources (COPO, GoaT, GBIF, ENA, NCBI). The options `--assembly`, `--biosample_wgs`, `--biosample_hic` and `--biosample_rna` specify what metadata to fetch and process.
+- Now outputs a partially completed genome note document based on a template file which contains placeholder parameters. These placeholders are replaced with data generated by the pipeline. The template file to use can be specified using the `--note_template` option.
+- Added the `--write_to_portal` option to write a set of key-value data parameters to a Genome Notes database.
+- Added the `--upload_higlass_data` option to automatically upload the Hi-C Map to a kubernetes hosted Hi-Glass server.
+- Bugfix: don't rely on fasta file name to correctly set assembly accession needed for use with `ncbi datasets`.
+- Bugfix: ensure meta.id is used consistently.
+
+### Parameters
+
+| Old parameter | New parameter              |
+| ------------- | -------------------------- |
+|               | --assembly                 |
+|               | --biosample_wgs            |
+|               | --biosample_hic            |
+|               | --biosample_rna            |
+|               | --write_to_portal          |
+|               | --genome_notes_api         |
+|               | --note_template            |
+|               | --upload_higlass_data      |
+|               | --higlass_url              |
+|               | --higlass_deployment_name  |
+|               | --higlass_namespace        |
+|               | --higlass_kubeconfig       |
+|               | --higlass_upload_directory |
+|               | --higlass_data_project_dir |
+
 ## [[1.2.2](https://github.com/sanger-tol/genomenote/releases/tag/1.2.2)] - Pyrenean Mountain Dog (patch 2) - [2024-09-10]
 
 ### Enhancements & fixes

diff --git a/CITATION.cff b/CITATION.cff
@@ -2,16 +2,24 @@
 # Visit https://bit.ly/cffinit to generate yours today!
 
 cff-version: 1.2.0
-title: sanger-tol/genomenote v1.2.2
+title: sanger-tol/genomenote v2.0.0
 message: >-
     If you use this software, please cite it using the
     metadata from this file.
 type: software
 authors:
+    - given-names: Sandra
+      family-names: Babiyre
+      affiliation: Wellcome Sanger Institute
+      orcid: "https://orcid.org/0009-0004-7773-7008"
     - given-names: Tyler
       family-names: Chafin
       affiliation: Wellcome Sanger Institute
       orcid: "https://orcid.org/0000-0001-8687-5905"
+    - given-names: Chau
+      family-names: Duong
+      affiliation: Wellcome Sanger Institute
+      orcid: "https://orcid.org/0009-0001-0649-2291"
     - given-names: Matthieu
       family-names: Muffato
       affiliation: Wellcome Sanger Institute
@@ -38,5 +46,5 @@ identifiers:
 repository-code: "https://github.com/sanger-tol/genomenote"
 license: MIT
 commit: TODO
-version: 1.2.2
+version: 2.0.0
 date-released: "2022-10-07"
diff --git a/README.md b/README.md
@@ -17,14 +17,15 @@
 
 <!--![sanger-tol/genomenote workflow](https://raw.githubusercontent.com/sanger-tol/genomenote/main/docs/images/sanger-tol-genomenote_workflow.png)-->
 
-1. Summary statistics ([`NCBI datasets summary genome accession`](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/summary/genome/datasets_summary_genome_accession/))
-2. Convert alignment to BED ([`samtools view`](https://www.htslib.org/doc/samtools-view.html), [`bedtools bamtobed`](https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html))
-3. Filter BED ([`GNU sort`](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html), [`filter bed`](https://raw.githubusercontent.com/sanger-tol/genomenote/main/bin/filter_bed.sh))
-4. Contact maps ([`Cooler cload`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs), [`Cooler zoomify`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-zoomify), [`Cooler dump`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-dump))
-5. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
-6. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
-7. Collated summary table ([`createtable`](bin/create_table.py))
-8. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
+1. Fetches genome metadata from [ENA](https://www.ebi.ac.uk/ena/browser/api/#/ENA_Browser_Data_API), [NCBI](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api), and [GoaT](https://goat.genomehubs.org/api-docs/)
+2. Summary statistics ([`NCBI datasets summary genome accession`](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/summary/genome/datasets_summary_genome_accession/))
+3. Convert alignment to BED ([`samtools view`](https://www.htslib.org/doc/samtools-view.html), [`bedtools bamtobed`](https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html))
+4. Filter BED ([`GNU sort`](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html), [`filter bed`](https://raw.githubusercontent.com/sanger-tol/genomenote/main/bin/filter_bed.sh))
+5. Contact maps ([`Cooler cload`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs), [`Cooler zoomify`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-zoomify), [`Cooler dump`](https://cooler.readthedocs.io/en/latest/cli.html#cooler-dump))
+6. Genome completeness ([`NCBI API`](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/rest-api/), [`BUSCO`](https://busco.ezlab.org))
+7. Consensus quality and k-mer completeness ([`FASTK`](https://github.com/thegenemyers/FASTK), [`MERQURY.FK`](https://github.com/thegenemyers/MERQURY.FK))
+8. Collated summary table ([`createtable`](bin/create_table.py))
+9. Present results and visualisations ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
 
 ## Usage
 
@@ -52,6 +53,9 @@ nextflow run sanger-tol/genomenote \
    -profile <docker/singularity/.../institute> \
    --input samplesheet.csv \
    --fasta genome.fasta \
+   --assembly GCA_922984935.2 \
+   --bioproject PRJEB49353 \
+   --biosample SAMEA7524400 \
    --outdir <OUTDIR>
 ```
 
@@ -69,8 +73,9 @@ sanger-tol/genomenote was originally written by [Priyanka Surana](https://github
 We thank the following people for their assistance in the development of this pipeline:
 
 - [Matthieu Muffato](https://github.com/muffato)
+- [Beth Yates](https://github.com/BethYates)
 - [Shane McCarthy](https://github.com/mcshane) and [Yumi Sims](https://github.com/yumisims) for providing software and algorithm guidance.
-- [Cibin Sadasivan Baby](https://github.com/cibinsb) and [Beth Yates](https://github.com/BethYates) for providing reviews.
+- [Cibin Sadasivan Baby](https://github.com/cibinsb) for providing reviews.
 
 ## Contributions and Support
 

diff --git a/assets/genome_metadata_template.csv b/assets/genome_metadata_template.csv
@@ -0,0 +1,9 @@
+#File_source,File_type,Url,Output_type
+ENA,Assembly,https://www.ebi.ac.uk/ena/browser/api/xml/ASSEMBLY_ACCESSION,xml
+ENA,Bioproject,https://www.ebi.ac.uk/ena/browser/api/xml/BIOPROJECT_ACCESSION,xml
+ENA,Biosample,https://www.ebi.ac.uk/ena/browser/api/xml/BIOSAMPLE_ACCESSION,xml
+ENA,Taxonomy,https://www.ebi.ac.uk/ena/browser/api/xml/TAXONOMY_ID,xml
+NCBI,Assembly,https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/ASSEMBLY_ACCESSION/dataset_report?filters.exclude_atypical=false&filters.assembly_version=current&chromosomes=1&chromosomes=2&chromosomes=3&chromosomes=X&chromosomes=Y&chromosomes=M,json
+NCBI,Taxonomy,https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=TAXONOMY_ID,xml
+GOAT,Assembly,http://goat.genomehubs.org/api/v2/record?recordId=ASSEMBLY_ACCESSION&result=assembly&taxonomy=ncbi,json
+COPO,Biosample,https://copo-project.org/api/sample/biosampleAccession/BIOSAMPLE_ACCESSION?standard=tol&return_type=json,json
diff --git a/assets/genome_note_template.docx b/assets/genome_note_template.docx
diff --git a/assets/genome_note_template.xml b/assets/genome_note_template.xml
@@ -0,0 +1,34 @@
+<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article>
+<article>
+    <body>
+        <sec>
+            <title>Species taxonomy</title>
+            <p>{{ TAX_STRING }};
+                <italic>{{ GENUS }}</italic>;
+                <italic>{{ GENUS_SPECIES }}</italic> ($TAXONOMY_AUTHORITY) (NCBI:txid{{ NCBI_TAXID }}) {{ TEST_NOT_REPLACED }}.
+            </p>
+        </sec>
+        <sec>
+            <table>
+                <thead>
+                    <tr>
+                        <th align="center" valign="top">INSDC accession</th>
+                        <th align="center" valign="top">Chromosome</th>
+                        <th align="center" valign="top">Length (Mb)</th>
+                        <th align="center" valign="top">GC%</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    {% for chromosome in CHR_TABLE %}
+                    <tr>
+                        <td align="left" valign="top">{{ chromosome.get('Accession') }}</td>
+                        <td align="center" valign="top">{{ chromosome.get('Chromosome') }}</td>
+                        <td align="center" valign="top">{{ chromosome.get('Length') }}</td>
+                        <td align="center" valign="top">{{ chromosome.get('GC') }}</td>
+                    </tr>
+                    {% endfor %}
+                </tbody>
+            </table>
+        </sec>
+    </body>
+</article>
diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,5 +1,4 @@
 sample,datatype,datafile
-uoEpiScrs1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/genomic_data/uoEpiScrs1/pacbio/m64228e_220617_134154.ccs.bc1015_BAK8B_OA--bc1015_BAK8B_OA.rmdup.subset.bam
-uoEpiScrs1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/genomic_data/uoEpiScrs1/pacbio/m64016e_220621_193126.ccs.bc1008_BAK8A_OA--bc1008_BAK8A_OA.rmdup.subset.bam
-uoEpiScrs1c,hic,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/analysis/uoEpiScrs1.1/read_mapping/hic/GCA_946965045.1.unmasked.hic.uoEpiScrs1.subsampled.cram
-uoEpiScrs1b,hic,https://tolit.cog.sanger.ac.uk/test-data/Epithemia_sp._CRS-2021b/analysis/uoEpiScrs1.1/read_mapping/hic/GCA_946965045.1.unmasked.hic.uoEpiScrs1.subsampled.bam
+ilCerPisi1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/genomic_data/ilCerPisi1/pacbio/m84047_230817_174414_s3.ccs.bc2048.subsampled.bam
+ilCerPisi1,pacbio,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/genomic_data/ilCerPisi1/pacbio/m64097e_230309_154741.ccs.bc1012_BAK8A_OA--bc1012_BAK8A_OA.subsampled.bam
+ilCerPisi1,hic,https://tolit.cog.sanger.ac.uk/test-data/Ceramica_pisi/analysis/ilCerPisi1.1/read_mapping/hic/GCA_963859965.1.unmasked.hic.ilCerPisi2.subsampled.cram
diff --git a/bin/check_parameters.py b/bin/check_parameters.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python3
+
+import os
+import sys
+import requests
+import argparse
+
+
+def parse_args(args=None):
+    Description = "Use the genome assembly accession to fetch additional infromation on genome from ENA"
+    Epilog = "Example usage: python check_parameters.py --assembly --wgs_biosample --output"
+
+    parser = argparse.ArgumentParser(description=Description, epilog=Epilog)
+    parser.add_argument("--assembly", required=True, help="The INSDC accession for the assembly")
+    parser.add_argument("--wgs_biosample", required=True, help="The biosample accession for the WGS data")
+    parser.add_argument("--hic_biosample", required=False, help="The biosample accession for the Hi-C data")
+    parser.add_argument("--rna_biosample", required=False, help="The biosample accession for the RNASeq data")
+    parser.add_argument("--output", required=True, help="Output file path")
+    return parser.parse_args()
+
+
+def make_dir(path):
+    if len(path) > 0:
+        os.makedirs(path, exist_ok=True)
+
+
+def fetch_assembly_data(assembly, wgs_biosample, hic_biosample, rna_biosample, output_file):
+    url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=assembly_set_accession%3D%22{assembly}%22&result=assembly&fields=assembly_set_accession%2Ctax_id%2Cscientific_name%2Cstudy_accession&limit=0&download=true&format=json"
+    response = requests.get(url)
+
+    if response.status_code == 200:
+        assembly_data = response.json()
+        taxon_id = assembly_data[0].get("tax_id", None)
+        species = assembly_data[0].get("scientific_name", None).replace(" ", "_")
+        study = assembly_data[0].get("study_accession", None)
+        params = [assembly, species, taxon_id]
+        header = ["assembly", "species", "taxon_id"]
+
+        if study:
+            study_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=study_accession%3D%22{study}%22&result=study&fields=parent_study_accession&limit=0&download=true&format=json"
+            study_response = requests.get(study_url)
+
+            if study_response.status_code == 200:
+                study_data = study_response.json()
+                studies = study_data[0].get("parent_study_accession").split(";")
+                params.append(studies[0])
+                header.append("bioproject")
+
+            else:
+                raise AssertionError(f"Could not determine the Bioproject linked to this assembly {assembly}\n")
+        else:
+            raise AssertionError(f"Could not determine the Bioproject linked to this assembly {assembly}\n")
+
+        # Validate wgs_biosample
+        wgs_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{wgs_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
+        wgs_response = requests.get(wgs_url)
+
+        if wgs_response.status_code == 200:
+            wgs_data = wgs_response.json()
+            tax_id = wgs_data[0].get("tax_id")
+
+            if tax_id != taxon_id:
+                raise AssertionError(
+                    f"The WGS biosample taxon id: {tax_id} does not match the assembly taxon id: {taxon_id}\n"
+                )
+            else:
+                params.append(wgs_biosample)
+                header.append("wgs_biosample")
+
+        else:
+            raise AssertionError(f"The WGS biosample id: {wgs_biosample} could not retrieved from ENA\n")
+
+        # Validate hic_biosample
+        if hic_biosample and hic_biosample != "null":
+            print(hic_biosample)
+            hic_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{hic_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
+            hic_response = requests.get(hic_url)
+
+            if hic_response.status_code == 200:
+                hic_data = hic_response.json()
+                hic_tax_id = hic_data[0].get("tax_id")
+
+                if hic_tax_id != taxon_id:
+                    raise AssertionError(
+                        f"The Hi-C biosample taxon id: {hic_tax_id} does not match the assembly taxon id: {taxon_id}\n"
+                    )
+                else:
+                    header.append("hic_biosample")
+                    params.append(hic_biosample)
+
+            else:
+                raise AssertionError(f"The Hi-C biosample id: {hic_biosample} could not retrieved from ENA\n")
+        else:
+            header.append("hic_biosample")
+            params.append("null")
+
+        # Validate rna_biosample
+        if rna_biosample and rna_biosample != "null":
+            rna_url = f"https://www.ebi.ac.uk/ena/portal/api/search?query=sample_accession%3D%22{rna_biosample}%22&result=sample&fields=sample_accession%2Ctax_id&limit=0&download=true&format=json"
+            rna_response = requests.get(rna_url)
+
+            if rna_response.status_code == 200:
+                rna_data = rna_response.json()
+                rna_tax_id = rna_data[0].get("tax_id")
+
+                if rna_tax_id != taxon_id:
+                    raise AssertionError(
+                        f"The RNASeq biosample taxon id: {rna_tax_id} does not match the assembly taxon id: {taxon_id}\n"
+                    )
+                else:
+                    header.append("rna_biosample")
+                    params.append(rna_biosample)
+
+            else:
+                raise AssertionError(f"The RNASeq biosample id: {rna_biosample} could not retrieved from ENA\n")
+
+        else:
+            header.append("rna_biosample")
+            params.append("null")
+
+        with open(output_file, "w") as fout:
+            # Write header
+            fout.write(",".join(header) + "\n")
+            fout.write(",".join(params) + "\n")
+
+            return output_file
+    else:
+        raise AssertionError(f"The assemby accession: {assembly} was not found\n")
+
+
+def main(args=None):
+    args = parse_args(args)
+    hic_biosample = args.hic_biosample
+    rna_biosample = args.rna_biosample
+    fetch_assembly_data(
+        args.assembly,
+        args.wgs_biosample,
+        hic_biosample,
+        rna_biosample,
+        args.output,
+    )
+
+
+if __name__ == "__main__":
+    sys.exit(main())