Skip to content

Commit

Permalink
Merge pull request #282 from labgem/dev
Browse files Browse the repository at this point in the history
 Merge dev branch into master to release version 2.1.2
  • Loading branch information
JeanMainguy authored Sep 16, 2024
2 parents 83c7d75 + fb539ff commit bc50066
Show file tree
Hide file tree
Showing 35 changed files with 709 additions and 281 deletions.
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.1.1
2.1.2
Binary file removed docs/_static/tile_plot.png
Binary file not shown.
Binary file added docs/_static/tile_plot_dendro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/tile_plot_no_dendro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
51 changes: 34 additions & 17 deletions docs/user/PangenomeAnalyses/pangenomeCluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,17 +124,17 @@ flowchart TD

#### Specify the representative gene

It's possible to indicate which gene is the representative gene by adding a column between the gene family name.
It's possible to indicate which gene is the representative gene by adding a third column.


Here is a minimal example of your clustering file:
Here is a minimal example of your clustering file with the third column being the representative gene:

```
Family_A Gene_2 Gene_1
Family_A Gene_1 Gene_2
Family_A Gene_2 Gene_2
Family_A Gene_2 Gene_3
Family_A Gene_3 Gene_2
Family_B Gene_4 Gene_4
Family_B Gene_5 Gene_4
Family_B Gene_5 Gene_5
Family_C Gene_6 Gene_6
```

Expand Down Expand Up @@ -174,25 +174,42 @@ flowchart TD

#### Indicate fragmented gene

It's also possible to indicate if the gene is fragmented, by adding a new column in last position. Fragmented gene are tag with an 'F' in the last column.
You can indicate if a gene is fragmented by adding a new column. Fragmented genes are marked with an 'F' in this final column.

You can add this column when you assume or not the representative gene. PPanGGOLiN will guess that this column is to precise the fragmented gene and assume if it must assert the representative gene
The position of this column depends on whether you include a representative gene column:
- Without a representative gene column, the fragmented gene column should be in the **third position**.
- With a representative gene column, it should appear in the **fourth position**.

Here is a minimal example of your clustering file with fragmented gene precise:
##### Example 1: Clustering file without representative gene column (fragmented gene in 3rd column):
```
Family_A Gene_1
Family_A Gene_2
Family_A Gene_3 F
Family_B Gene_4
Family_B Gene_5
Family_C Gene_6 F
```

##### Example 2: Clustering file with representative gene column (fragmented gene in 4th column):
```
Family_A Gene_2 Gene_1
Family_A Gene_2 Gene_2
Family_A Gene_2 Gene_3 'F'
Family_B Gene_5 Gene_4
Family_B Gene_5 Gene_5
Family_C Gene_6 Gene_6 'F'
Family_A Gene_1 Gene_2
Family_A Gene_2 Gene_2
Family_A Gene_3 Gene_2 F
Family_B Gene_4 Gene_4
Family_B Gene_5 Gene_4
Family_C Gene_6 Gene_6 F
```

```{warning}
*The column order is
important !* You must first provide the cluster identifier, the representative id, and then the gene id to finish with
the fragmented status of the gene.
*Attention: Column Order Matters!*
Please ensure that your columns follow the correct order:
1. Cluster identifier
2. Gene ID
3. Representative gene ID (if present)
4. Fragmented status ('F' if the gene is fragmented, or leave blank)
If no representative gene column is included, the fragmented status should be placed in the **third column**.
```

### Defragmentation
Expand Down
23 changes: 17 additions & 6 deletions docs/user/PangenomeAnalyses/pangenomeFigures.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,37 @@ ppanggolin draw -p pangenome.h5 --ucurve

#### Tile plot

A tile plot is similar to a heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (the color depends on the number of gene copies) and uncolored if absent. The gene families are ordered by partition and then by a hierarchical clustering, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
A tile plot is a kind of heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (either in blue or red if the gene family has multiple gene copies) and uncolored if absent. The gene families are ordered by partition and then by their number of presences (increasing order), and the genomes are ordered by a hierarchical clustering based on their shared gene families via a Jaccard distance (basically two genomes that are close together in terms of gene family composition will be close together on the figure).

This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you the gene identifier(s), the gene family and the genome corresponding to the tile.
This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate the gene identifier(s), the gene family and the genome corresponding to the tile. As detailed below, additional metadata can be displayed.

If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 500 genomes, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 60k gene families, the plot will not be drawn; if you have more than 32 767 gene families, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise. Beyond the workflow subcommand, you can generate the plot with any number of gene families or genomes. However, no browser currently supports visualizing a tile plot containing more than 65 535 gene families or more than 65 535 genomes (for more information, refer to [this Stack Overflow discussion](https://stackoverflow.com/questions/78431835/plotly-heatmap-has-limit-on-data-size)
).

It can be generated using the 'draw' subcommand as such :
To generate a tile plot, use the 'draw' subcommand as follows:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot
```

![tile plot figure](../../_static/tile_plot.png)
![Tile plot figure](../../_static/tile_plot_no_dendro.png)

and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :
If you prefer not to include 'cloud' gene families, which can make the plot difficult to render in a browser, you can use the `--nocloud` option:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
```

To include a dendrogram on top of the tile plot, use the `--add_dendrogram` option:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot --add_dendrogram
```

![Tile plot with dendrogram](../../_static/tile_plot_dendro.png)

If you have added metadata to the gene elements of your pangenome (see [metadata documentation](../metadata.md) for details), you can display this metadata in the hover text by using the `--add_metadata` argument.

#### Rarefaction curve
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group (Tettelin et al., 2005). The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary, if you are still adding a lot of genes you may be still missing a lot of gene families.

Expand Down
15 changes: 12 additions & 3 deletions docs/user/PangenomeAnalyses/pangenomeStat.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,10 +144,19 @@ To generate these files, use the `write_pangenome` subcommand with the `--partit
`ppanggolin write_pangenome -p pangenome.h5 --partitions`


#### Gene Families to Genes Associations

The `gene_families.tsv` file mirrors the format provided by [MMseqs2](https://github.com/soedinglab/MMseqs2) through its `createtsv` subcommand. This file structure comprises three columns: the gene family name in the first column, the gene names in the second, and a third column that remains empty or contains an "F" to denote potential gene fragments instead of complete genes. This indication appears only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used.
#### Gene Families to Gene Associations

The `gene_families.tsv` file consists of four columns:

1. **Gene Family ID**: The identifier for the gene family.
2. **Gene ID**: The identifier for the gene.
3. **Local ID** (if applicable): This column is used when gene IDs from annotation files are not unique. In such cases, `ppanggolin` assigns an internal ID to genes, and this column helps map the internal ID to the local ID from the annotation file.
4. **Fragmentation Status**: This column indicates whether the gene is fragmented. It is either empty or contains an "F" to signify potential gene fragments instead of complete genes. This status is provided only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used, which is the default behavior.

To generate this file, use the `write_pangenome` subcommand with the `--families_tsv` flag:

`ppanggolin write_pangenome -p pangenome.h5 --families_tsv`
```bash
ppanggolin write_pangenome -p pangenome.h5 --families_tsv
```

10 changes: 5 additions & 5 deletions ppanggolin/RGP/rgp_cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ def compute_grr(rgp_a_families: Set[GeneFamily], rgp_b_families: Set[GeneFamily]
:return: GRR value between 0 and 1
"""

grr = len((rgp_a_families & rgp_b_families)) / mode(len(rgp_a_families), len(rgp_b_families))
grr = len(rgp_a_families & rgp_b_families) / mode(len(rgp_a_families), len(rgp_b_families))

return grr

Expand All @@ -147,7 +147,7 @@ def compute_jaccard_index(rgp_a_families: set, rgp_b_families: set) -> float:
:return : Jaccard index
"""

jaccard_index = len((rgp_a_families & rgp_b_families)) / len(rgp_a_families | rgp_b_families)
jaccard_index = len(rgp_a_families & rgp_b_families) / len(rgp_a_families | rgp_b_families)

return jaccard_index

Expand Down Expand Up @@ -351,7 +351,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
families_to_rgps = defaultdict(list)

for rgp in tqdm(rgps, total=len(rgps), unit="RGP", disable=disable_bar):
families_to_rgps[tuple(sorted((f.ID for f in rgp.families)))].append(rgp)
families_to_rgps[tuple(sorted(f.ID for f in rgp.families))].append(rgp)

dereplicated_rgps = []
identical_region_count = 0
Expand All @@ -362,7 +362,7 @@ def dereplicate_rgp(rgps: Set[Union[Region, IdenticalRegions]],
families = set(rgps[0].families)

# identical regions object is considered on a contig border if all rgp are contig border
is_contig_border = all([rgp.is_contig_border for rgp in rgps])
is_contig_border = all(rgp.is_contig_border for rgp in rgps)

# create a new object that will represent the identical rgps
identical_rgp = IdenticalRegions(name=f"identical_rgps_{identical_region_count}",
Expand Down Expand Up @@ -548,7 +548,7 @@ def cluster_rgp(pangenome, grr_cutoff: float, output: str, basename: str,
dereplicated_rgps = dereplicate_rgp(valid_rgps, disable_bar=disable_bar)

grr_graph = nx.Graph()
grr_graph.add_nodes_from((rgp.ID for rgp in dereplicated_rgps))
grr_graph.add_nodes_from(rgp.ID for rgp in dereplicated_rgps)

# Get all pairs of RGP that share at least one family

Expand Down
4 changes: 2 additions & 2 deletions ppanggolin/align/alignOnPang.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def map_input_gene_to_family_all_aln(aln_res: Path, outdir: Path,
aln_file_clean = outdir / "alignment_input_seqs_to_all_pangenome_genes.tsv" # write the actual result file
logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')

with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
for line in alnFile:
line_splitted = line.split()

Expand Down Expand Up @@ -151,7 +151,7 @@ def map_input_gene_to_family_rep_aln(aln_res: Path, outdir: Path,

logging.getLogger("PPanGGOLiN").debug(f'Writing alignment file in {aln_file_clean}')

with open(aln_res, "r") as alnFile, open(aln_file_clean, "w") as aln_outfl:
with open(aln_res) as alnFile, open(aln_file_clean, "w") as aln_outfl:
for line in alnFile:
line_splitted = line.split()

Expand Down
20 changes: 12 additions & 8 deletions ppanggolin/annotate/annotate.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def create_gene(org: Organism, contig: Contig, gene_counter: int, rna_counter: i

start, stop = coordinates[0][0], coordinates[-1][1]

if any((dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs)):
if any(dbxref.startswith('MaGe:') or dbxref.startswith('SEED:') for dbxref in dbxrefs):
if gene_name == "":
gene_name = gene_id

Expand Down Expand Up @@ -378,7 +378,7 @@ def combine_contigs_metadata(contig_to_metadata: Dict[str, Dict[str, str]]) -> T
all_tag_to_value.count(tag_and_value) == contig_count}

# Create a dictionary for shared metadata
genome_metadata = {tag: value for tag, value in shared_tag_and_values}
genome_metadata = dict(shared_tag_and_values)

contig_to_uniq_metadata = {}
for contig, tag_to_value in contig_to_metadata.items():
Expand Down Expand Up @@ -793,7 +793,10 @@ def check_chevrons_in_start_and_stop(start: str, stop: str) -> Tuple[int, int, b
correct_putative_overlaps(org.contigs)

# GET THE FASTA SEQUENCES OF THE GENES
if has_fasta and fasta_string != "":
if fasta_string == "":
has_fasta = False

if has_fasta:
contig_sequences = read_fasta(org, fasta_string.split('\n')) # _ is total contig length
for contig in org.contigs:
if contig.length != len(contig_sequences[contig.name]):
Expand Down Expand Up @@ -1073,9 +1076,10 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
futures.append(future)

for future in futures:
org, flag = future.result()
org, has_dna_sequence = future.result()
pangenome.add_organism(org)
if not flag:

if not has_dna_sequence:
pangenome.status["geneSequences"] = "No"

# decide whether we use local ids or ppanggolin ids.
Expand All @@ -1093,11 +1097,11 @@ def read_annotations(pangenome: Pangenome, organisms_file: Path, cpu: int = 1, p
pangenome.parameters["annotate"]["use_pseudo"] = pseudo
pangenome.parameters["annotate"]["# read_annotations_from_file"] = True

if any((genome.has_metadata() for genome in pangenome.organisms)):
if any(genome.has_metadata() for genome in pangenome.organisms):
pangenome.status["metadata"]["genomes"] = "Computed"
pangenome.status["metasources"]["genomes"].append("annotation_file")

if any((contig.has_metadata() for contig in pangenome.contigs)):
if any(contig.has_metadata() for contig in pangenome.contigs):
pangenome.status["metadata"]["contigs"] = "Computed"
pangenome.status["metasources"]["contigs"].append("annotation_file")

Expand Down Expand Up @@ -1151,7 +1155,7 @@ def get_gene_sequences_from_fastas(pangenome: Pangenome, fasta_files: Path, disa
msg = f"Fasta file for genome {org.name} did not have the contig {contig.name} " \
f"that was read from the annotation file. "
msg += f"The provided contigs in the fasta were : " \
f"{', '.join([contig for contig in fasta_dict[org].keys()])}."
f"{', '.join(fasta_dict[org].keys())}."
raise KeyError(msg)
pangenome.status["geneSequences"] = "Computed"

Expand Down
Loading

0 comments on commit bc50066

Please sign in to comment.