Merge pull request #282 from labgem/dev

Merge dev branch into master to release version 2.1.2
labgem · Sep 16, 2024 · bc50066 · bc50066
2 parents 83c7d75 + fb539ff
commit bc50066
Show file tree

Hide file tree

Showing 35 changed files with 709 additions and 281 deletions.
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.1.1
+2.1.2
diff --git a/docs/_static/tile_plot.png b/docs/_static/tile_plot.png
diff --git a/docs/_static/tile_plot_dendro.png b/docs/_static/tile_plot_dendro.png
diff --git a/docs/_static/tile_plot_no_dendro.png b/docs/_static/tile_plot_no_dendro.png
diff --git a/docs/user/PangenomeAnalyses/pangenomeCluster.md b/docs/user/PangenomeAnalyses/pangenomeCluster.md
@@ -124,17 +124,17 @@ flowchart TD
 
 #### Specify the representative gene
 
-It's possible to indicate which gene is the representative gene by adding a column between the gene family name.
+It's possible to indicate which gene is the representative gene by adding a third column.
 
 
-Here is a minimal example of your clustering file:
+Here is a minimal example of your clustering file with the third column being the representative gene:
 
 ```
-Family_A    Gene_2  Gene_1
+Family_A    Gene_1  Gene_2
 Family_A    Gene_2  Gene_2
-Family_A    Gene_2  Gene_3
+Family_A    Gene_3  Gene_2
+Family_B    Gene_4  Gene_4
 Family_B    Gene_5  Gene_4
-Family_B    Gene_5  Gene_5
 Family_C    Gene_6  Gene_6
 ```
 
@@ -174,25 +174,42 @@ flowchart TD
 
 #### Indicate fragmented gene
 
-It's also possible to indicate if the gene is fragmented, by adding a new column in last position. Fragmented gene are tag with an 'F' in the last column.
+You can indicate if a gene is fragmented by adding a new column. Fragmented genes are marked with an 'F' in this final column.
 
-You can add this column when you assume or not the representative gene. PPanGGOLiN will guess that this column is to precise the fragmented gene and assume if it must assert the representative gene
+The position of this column depends on whether you include a representative gene column:
+- Without a representative gene column, the fragmented gene column should be in the **third position**.
+- With a representative gene column, it should appear in the **fourth position**.
 
-Here is a minimal example of your clustering file with fragmented gene precise:
+##### Example 1: Clustering file without representative gene column (fragmented gene in 3rd column):
+```
+Family_A	Gene_1
+Family_A	Gene_2
+Family_A	Gene_3	F
+Family_B	Gene_4
+Family_B	Gene_5
+Family_C	Gene_6	F
+```
 
+##### Example 2: Clustering file with representative gene column (fragmented gene in 4th column):
 ```
-Family_A    Gene_2  Gene_1  
-Family_A    Gene_2  Gene_2  
-Family_A    Gene_2  Gene_3  'F'
-Family_B    Gene_5  Gene_4  
-Family_B    Gene_5  Gene_5  
-Family_C    Gene_6  Gene_6  'F'
+Family_A	Gene_1	Gene_2
+Family_A	Gene_2	Gene_2
+Family_A	Gene_3	Gene_2	F
+Family_B	Gene_4	Gene_4
+Family_B	Gene_5	Gene_4
+Family_C	Gene_6	Gene_6	F
 ```
 
 ```{warning}
-*The column order is
-important !*  You must first provide the cluster identifier, the representative id, and then the gene id to finish with
-the fragmented status of the gene.
+*Attention: Column Order Matters!*
+
+Please ensure that your columns follow the correct order:
+1. Cluster identifier
+2. Gene ID
+3. Representative gene ID (if present)
+4. Fragmented status ('F' if the gene is fragmented, or leave blank)
+
+If no representative gene column is included, the fragmented status should be placed in the **third column**.
 ```
 
 ### Defragmentation

diff --git a/docs/user/PangenomeAnalyses/pangenomeFigures.md b/docs/user/PangenomeAnalyses/pangenomeFigures.md
@@ -17,26 +17,37 @@ ppanggolin draw -p pangenome.h5 --ucurve
 
 #### Tile plot
 
-A tile plot is similar to a heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (the color depends on the number of gene copies) and uncolored if absent. The gene families are ordered by partition and then by a hierarchical clustering, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
+A tile plot is a kind of heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (either in blue or red if the gene family has multiple gene copies) and uncolored if absent. The gene families are ordered by partition and then by their number of presences (increasing order), and the genomes are ordered by a hierarchical clustering based on their shared gene families via a Jaccard distance (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
 
-This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you the gene identifier(s), the gene family and the genome corresponding to the tile.
+This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate the gene identifier(s), the gene family and the genome corresponding to the tile. As detailed below, additional metadata can be displayed.
 
-If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 500 genomes, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
+If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 60k gene families, the plot will not be drawn; if you have more than 32 767 gene families, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise. Beyond the workflow subcommand, you can generate the plot with any number of gene families or genomes. However, no browser currently supports visualizing a tile plot containing more than 65 535 gene families or more than 65 535 genomes (for more information, refer to [this Stack Overflow discussion](https://stackoverflow.com/questions/78431835/plotly-heatmap-has-limit-on-data-size)
+ ).
 
-It can be generated using the 'draw' subcommand as such : 
+To generate a tile plot, use the 'draw' subcommand as follows:
 
 ```bash
 ppanggolin draw -p pangenome.h5 --tile_plot
 ```
 
-![tile plot figure](../../_static/tile_plot.png)
+![Tile plot figure](../../_static/tile_plot_no_dendro.png)
 
-and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option : 
+If you prefer not to include 'cloud' gene families, which can make the plot difficult to render in a browser, you can use the `--nocloud` option:
 
 ```bash
 ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
 ```
 
+To include a dendrogram on top of the tile plot, use the `--add_dendrogram` option:
+
+```bash
+ppanggolin draw -p pangenome.h5 --tile_plot --add_dendrogram
+```
+
+![Tile plot with dendrogram](../../_static/tile_plot_dendro.png)
+
+If you have added metadata to the gene elements of your pangenome (see [metadata documentation](../metadata.md) for details), you can display this metadata in the hover text by using the `--add_metadata` argument.
+
 #### Rarefaction curve
 This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group (Tettelin et al., 2005). The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary, if you are still adding a lot of genes you may be still missing a lot of gene families. 
 

diff --git a/docs/user/PangenomeAnalyses/pangenomeStat.md b/docs/user/PangenomeAnalyses/pangenomeStat.md
@@ -144,10 +144,19 @@ To generate these files, use the `write_pangenome` subcommand with the `--partit
 `ppanggolin write_pangenome -p pangenome.h5 --partitions`
 
 
-#### Gene Families to Genes Associations
 
-The `gene_families.tsv` file mirrors the format provided by [MMseqs2](https://github.com/soedinglab/MMseqs2) through its `createtsv` subcommand. This file structure comprises three columns: the gene family name in the first column, the gene names in the second, and a third column that remains empty or contains an "F" to denote potential gene fragments instead of complete genes. This indication appears only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used.
+#### Gene Families to Gene Associations
+
+The `gene_families.tsv` file consists of four columns:
+
+1. **Gene Family ID**: The identifier for the gene family.
+2. **Gene ID**: The identifier for the gene.
+3. **Local ID** (if applicable): This column is used when gene IDs from annotation files are not unique. In such cases, `ppanggolin` assigns an internal ID to genes, and this column helps map the internal ID to the local ID from the annotation file.
+4. **Fragmentation Status**: This column indicates whether the gene is fragmented. It is either empty or contains an "F" to signify potential gene fragments instead of complete genes. This status is provided only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used, which is the default behavior.
 
 To generate this file, use the `write_pangenome` subcommand with the `--families_tsv` flag:
 
-`ppanggolin write_pangenome -p pangenome.h5 --families_tsv`
+```bash
+ppanggolin write_pangenome -p pangenome.h5 --families_tsv
+```
+
diff --git a/ppanggolin/RGP/rgp_cluster.py b/ppanggolin/RGP/rgp_cluster.py
@@ -132,7 +132,7 @@ def compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily]
     :return: GRR value between 0 and 1
     """
 
-    grr = len((rgp_a_families & rgp_b_families)) / mode(len(rgp_a_families), len(rgp_b_families))
+    grr = len(rgp_a_families & rgp_b_families) / mode(len(rgp_a_families), len(rgp_b_families))
 
     return grr
 
@@ -147,7 +147,7 @@ def compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) -> float:
     :return : Jaccard index
     """
 
-    jaccard_index = len((rgp_a_families & rgp_b_families)) / len(rgp_a_families | rgp_b_families)
+    jaccard_index = len(rgp_a_families & rgp_b_families) / len(rgp_a_families | rgp_b_families)
 
     return jaccard_index
 
@@ -351,7 +351,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
     families_to_rgps = defaultdict(list)
 
     for rgp in tqdm(rgps, total=len(rgps), unit="RGP", disable=disable_bar):
-        families_to_rgps[tuple(sorted((f.ID for f in rgp.families)))].append(rgp)
+        families_to_rgps[tuple(sorted(f.ID for f in rgp.families))].append(rgp)
 
     dereplicated_rgps = []
     identical_region_count = 0
@@ -362,7 +362,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
             families = set(rgps[0].families)
 
             # identical regions object is considered on a contig border if all rgp are contig border
-            is_contig_border = all([rgp.is_contig_border for rgp in rgps])
+            is_contig_border = all(rgp.is_contig_border for rgp in rgps)
 
             # create a new object that will represent the identical rgps
             identical_rgp = IdenticalRegions(name=f"identical_rgps_{identical_region_count}",
@@ -548,7 +548,7 @@ def cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str,
     dereplicated_rgps = dereplicate_rgp(valid_rgps, disable_bar=disable_bar)
 
     grr_graph = nx.Graph()
-    grr_graph.add_nodes_from((rgp.ID for rgp in dereplicated_rgps))
+    grr_graph.add_nodes_from(rgp.ID for rgp in dereplicated_rgps)
 
     # Get all pairs of RGP that share at least one family
 

diff --git a/ppanggolin/align/alignOnPang.py b/ppanggolin/align/alignOnPang.py
@@ -116,7 +116,7 @@ def map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path,
     aln_file_clean = outdir / "alignment_input_seqs_to_all_pangenome_genes.tsv"  # write the actual result file
     logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')
 
-    with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
+    with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
         for line in alnFile:
             line_splitted = line.split()
 
@@ -151,7 +151,7 @@ def map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path,
 
     logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')
 
-    with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
+    with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
         for line in alnFile:
             line_splitted = line.split()
 

diff --git a/ppanggolin/annotate/annotate.py b/ppanggolin/annotate/annotate.py
@@ -73,7 +73,7 @@ def create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: i
 
     start, stop = coordinates[0][0], coordinates[-1][1]
 
-    if any((dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs)):
+    if any(dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs):
         if gene_name == "":
             gene_name = gene_id
 
@@ -378,7 +378,7 @@ def combine_contigs_metadata(contig_to_metadata: Dict[str, Dict[str, str]]) -> T
                              all_tag_to_value.count(tag_and_value) == contig_count}
 
     # Create a dictionary for shared metadata
-    genome_metadata = {tag: value for tag, value in shared_tag_and_values}
+    genome_metadata = dict(shared_tag_and_values)
 
     contig_to_uniq_metadata = {}
     for contig, tag_to_value in contig_to_metadata.items():
@@ -793,7 +793,10 @@ def check_chevrons_in_start_and_stop(start: str, stop: str) -> Tuple[int, int, b
     correct_putative_overlaps(org.contigs)
 
     # GET THE FASTA SEQUENCES OF THE GENES
-    if has_fasta and fasta_string != "":
+    if fasta_string == "":
+        has_fasta = False
+
+    if has_fasta:
         contig_sequences = read_fasta(org, fasta_string.split('\n'))  # _ is total contig length
         for contig in org.contigs:
             if contig.length != len(contig_sequences[contig.name]):
@@ -1073,9 +1076,10 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
                 futures.append(future)
 
             for future in futures:
-                org, flag = future.result()
+                org, has_dna_sequence = future.result()
                 pangenome.add_organism(org)
-                if not flag:
+
+                if not has_dna_sequence:
                     pangenome.status["geneSequences"] = "No"
 
     # decide whether we use local ids or ppanggolin ids.
@@ -1093,11 +1097,11 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
     pangenome.parameters["annotate"]["use_pseudo"] = pseudo
     pangenome.parameters["annotate"]["# read_annotations_from_file"] = True
 
-    if any((genome.has_metadata() for genome in pangenome.organisms)):
+    if any(genome.has_metadata() for genome in pangenome.organisms):
         pangenome.status["metadata"]["genomes"] = "Computed"
         pangenome.status["metasources"]["genomes"].append("annotation_file")
 
-    if any((contig.has_metadata() for contig in pangenome.contigs)):
+    if any(contig.has_metadata() for contig in pangenome.contigs):
         pangenome.status["metadata"]["contigs"] = "Computed"
         pangenome.status["metasources"]["contigs"].append("annotation_file")
 
@@ -1151,7 +1155,7 @@ def get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disa
                     msg = f"Fasta file for genome {org.name} did not have the contig {contig.name} " \
                           f"that was read from the annotation file. "
                     msg += f"The provided contigs in the fasta were : " \
-                           f"{', '.join([contig for contig in fasta_dict[org].keys()])}."
+                           f"{', '.join(fasta_dict[org].keys())}."
                     raise KeyError(msg)
     pangenome.status["geneSequences"] = "Computed"