Name		Name	Last commit message	Last commit date
parent directory ..
Category_importance.tsv		Category_importance.tsv
Cook_unique_hosts.txt		Cook_unique_hosts.txt
README.md		README.md
all.json		all.json
assembly_summary_20220601.txt.gz		assembly_summary_20220601.txt.gz
bac120_metadata_r207.tsv.gz		bac120_metadata_r207.tsv.gz
bac120_metadata_r95.tsv.gz		bac120_metadata_r95.tsv.gz
bac120_r207.tree.gz		bac120_r207.tree.gz
base_pp.tsv.gz		base_pp.tsv.gz
categories.tsv.gz		categories.tsv.gz
checkv.tsv.gz		checkv.tsv.gz
country_importance.txt		country_importance.txt
country_importance_table.tsv.gz		country_importance_table.tsv.gz
dnaA_locations.tsv.gz		dnaA_locations.tsv.gz
genbank.besthits.tsv.gz		genbank.besthits.tsv.gz
genome_lengths.tsv.gz		genome_lengths.tsv.gz
insertion_lengths.tsv.gz		insertion_lengths.tsv.gz
matched.json		matched.json
patric_genome_metadata.tsv.gz		patric_genome_metadata.tsv.gz
phage_lengths.tsv.gz		phage_lengths.tsv.gz
phage_locations_20220620.tsv.gz		phage_locations_20220620.tsv.gz
phage_protein_counts.tsv.gz		phage_protein_counts.tsv.gz
phage_stats.20220620.tsv.gz		phage_stats.20220620.tsv.gz
phages_per_genome.tsv.gz		phages_per_genome.tsv.gz
prophage_locations.tsv.gz		prophage_locations.tsv.gz
prophage_statistics.tsv.gz		prophage_statistics.tsv.gz
smallphages.txt		smallphages.txt
transposase_per_phage.tsv.gz		transposase_per_phage.tsv.gz
transposon_counts.tsv.gz		transposon_counts.tsv.gz
unprocessed_phages.txt		unprocessed_phages.txt
vogs_hits_counts.txt.gz		vogs_hits_counts.txt.gz

README.md

Data from different sources

This data is sourced from a variety of web locations and compiled to analyse the prophage counts.

We keep all of it as gzip compressed files and then read those into dataframes using the compression='gzip' flag to pandas.read_csv

GenBank: assembly_summary.txt.gz. This is the GenBank metadata associated with all the assemblies
- Available from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
PATRIC: patric_genome_metadata.tsv.gz. PATRIC compiles metadata from a variety of sources. This is a little tricky because it is not guaranteed to be unique per assembly ID, and so need to take the first instance. In addition, some of the fields like date and country need cleaning up.
- Available from ftp://ftp.patricbrc.org/RELEASE_NOTES/genome_metadata
GTDB: bac120_metadata_r95.tsv.gz The GTDB taxonomy and some of the metadata available in GenBank or PATRIC but additional data too. Some of the fields found in PATRIC (notably, isolation_date are missing from this file.)

categories.tsv This is a four column table that we have constructed with our own environmental categories and our biome category.
checkv.tsv.gz The results from checkv analysis of all the prophages.
country_importance.txt.gz The importance of each country in the analysis.