PSASS (Pooled Sequencing Analysis for Sex Signal) is a software to compare pooled sequencing datasets from two groups (usually two sexes). Results from PSASS can be easily visualized using the sgtr R package. PSASS is integrated in a Snakemake workflow to perform all required steps starting from a genome and reads files. PSASS was developed as part of a project by the LPGP lab from INRA, Rennes, France.
Citing PSASS
There is currently no paper officially describing PSASS. To cite PSASS, use the DOI provided by Zenodo.
PSASS is available in Bioconda. To install psass with conda, run the following command:
conda install -c bioconda psass
PSASS implements parsing of alignment files using the htslib library, which requires autotools to build and depends on zlib, libbz2, and liblzma to read CRAM files. All are installed by default on most linux distributions. Compilation was tested with gcc >= 5.3.0.
To build psass, follow these instructions:
# Clone the repository
git clone https://github.com/RomainFeron/PSASS.git
# Navigate to the PSASS directory
cd psass
# Build PSASS
make
In summary, psass takes as input one reads alignment file for each pool and computes the following metrics:
- Between-pools Fst in a sliding window
- Number of SNPs specific to each pool, defined as SNPs heterozygous in one pool and homozygous in the other, in a sliding window
- Absolute and relative depth for each pool in a sliding window
- (optional) The position of all bases with high between-pools Fst
- (optional) The position of all SNPs specific to each pool
- (optional) Number of SNPs specific to each pool, and absolute and relative depth in each pool for all genes in a provided GFF file, for both coding and noncoding parts of the genes
Currently, psass implements three commands:
pileup
: generate a nucleotides depth file from two alignment filesanalyze
: compute pool comparison metrics from a nucleotides depth fileconvert
: convert output from samtools mpileup to a nucleotides depth file (deprecated, usepileup
instead)
In this example, we will compare a pool of female individuals and a pool of male individuals with psass
. We assume the following input data:
- genome.fa: the assembly to which reads from each pool were aligned
- females.cram: alignment file for the reads from the female pool (sorted by genomic coordinates)
- males.cram: alignment file for the reads from the male pool (sorted by genomic coordinates)
psass pileup --reference genome.fa --output-file pileup.tsv females.cram males.cram
This command generates the file pileup.tsv which contains the nucleotide composition of each base in genome.fa for each pool.
psass analyze --window-size 10000 --output-resolution 1000 --snp-file psass_snps.tsv pileup.tsv psass_window.tsv
This command generates two output files:
- psass_window.tsv which contains between-pool FST, pool-specific SNPs, and depth for each pool in a sliding window of 10,000 bp, output every 1,000 bp.
- psass_snps.tsv which contains the position of each pool-specific SNP.
The pileup function generates a file with the nucleotide composition of all genomic positions for any number of alignment files. Alignment files can be provided in CRAM or BAM format and need to be sorted by coordinate. The output is a wig-like file with a header line giving the contig name and length followed by one line giving the nucleotide composition in each alignment file for each positions in the contig for all contigs in the alignment files.
Usage: psass pileup [OPTIONS] ALIGNMENT_FILES...
Arguments:
Argument | Type | Description | Default |
---|---|---|---|
ALIGNMENT_FILES | file |
One alignment file for each pool, in either CRAM or BAM format | |
--reference, -r | file |
Reference file in fasta format, required with CRAM input files | |
--output-file, -o | string |
Write output to this file instead of stdout | |
--min-map-quality, -q | string |
Minimum mapping quality to include a read in pileup | 0 |
--help | Display help message |
pileup generates a file with the following format:
#Files <first_input_file> <second_input_file> # Comment line
region=<contig name> len=<contig length> # Header line for the first contig, encoding the contig name and its length
nA,nT,nC,nG,nN,nO nA,nT,nC,nG,nN,nO # Count for each type of nucleotide (comma-separated) in each pool (tab-separated) for pos 0
nA,nT,nC,nG,nN,nO nA,nT,nC,nG,nN,nO # Count for each type of nucleotide (comma-separated) in each pool (tab-separated) for pos 1
...
region=<contig name> len=<contig length> # Header line for the second contig, encoding the contig name and its length
nA,nT,nC,nG,nN,nO nA,nT,nC,nG,nN,nO # Count for each type of nucleotide (comma-separated) in each pool (tab-separated) for pos 0
nA,nT,nC,nG,nN,nO nA,nT,nC,nG,nN,nO # Count for each type of nucleotide (comma-separated) in each pool (tab-separated) for pos 1
...
The analyze function computes between-pool FST, pool-specific SNPs, and depth in a sliding window from a nucleotide composition file generate with psass pileup
from two alignment files.
Usage: psass analyze [OPTIONS] INPUT_FILE OUTPUT_FILE
Arguments:
Argument | Type | Description | Default |
---|---|---|---|
INPUT_FILE | file |
Path to a nucleotides depth file generated by psass pileup or convert | |
OUTPUT_FILE | string |
Path to an output file for sliding window metrics | |
--pool1, -p | string |
Name of the first pool | females |
--pool2, -q | string |
Name of the second pool | males |
--snp-file, -s | string |
Output pool-specific SNPs to this file | |
--fst-file, -f | string |
Output high FST positions to this file | |
--genes-file, -g | string |
Output gene metrics to this file (requires a GFF file) | |
--gff-file, -G | string |
Path to a GFF file for gene-specific output | |
--popoolation | If set, assumes the input file was generated with popoolation2 | ||
--min-depth, -d | int |
Minimum depth to include a site in the analyses | 10 |
--window-size, -w | int |
Size of the sliding window (in bp) | 100000 |
--output-resolution, -r | int |
Output resolution for sliding window metrics (in bp) | 10000 |
--freq-het, -e | float |
Allele frequency to consider a SNP heterozygous in a pool | 0.5 |
--range-het, -u | float |
Range of allele frequency to consider a SNP heterozygous in a pool | 0.15 |
--freq-hom, -o | float |
Allele frequency to consider a SNP homozygous in a pool | 1 |
--range-hom, -v | float |
Range of allele frequency to consider a SNP homozygous in a pool | 0.05 |
--min-fst, -t | float |
Minimum FST to output a site in the FST positions file | 0.1 |
--group-snps | If set, group consecutive snps to count them as a single polymorphism | ||
--help | Print this help message and exit |
Sliding window output file
A tabulated file with contig, position on the contig, contig length, number of pool-specific SNPs, between-pool FST in the window, absolute depth, and relative depth for each pool in a sliding window of size given by --window-size
. Output every N bp, with N given by --output-resolution
.
Contig Position Length Snps_<pool1> Snps_<pool2> Fst Abs_depth_<pool1> Abs_depth_<pool2> Rel_depth_<pool1> Rel_depth_<pool2>
Contig1 0 6000 4 5 0.0000 166 174 0.73 0.74
Contig1 10000 6000 4 6 0.0000 156 165 0.68 0.70
Contig1 20000 6000 5 6 0.0000 181 193 0.79 0.82
Contig1 30000 6000 5 6 0.0000 167 178 0.74 0.76
Contig1 40000 6000 6 6 0.0000 154 164 0.68 0.70
...
Pool-specific SNPs output file
A tabulated file with contig, position on the contig, pool ID, and nucleotides frequencies at this position in each pool (O for 'other') for all positions called as pool-specific SNPs.
Contig Position Pool <pool1>_A <pool1>_T <pool1>_C <pool1>_G <pool1>_N <pool1>_O <pool2>_A <pool2>_T <pool2>_C <pool2>_G <pool2>_N <pool2>_I
Contig1 6170 <pool1> 0.06 0.19 0.12 0.62 0.00 0.00 0.05 0.00 0.00 0.95 0.00 0.00
Contig1 6176 <pool2> 0.62 0.08 0.23 0.08 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00
Contig1 7090 <pool2> 1.00 0.00 0.00 0.00 0.00 0.00 0.65 0.18 0.12 0.06 0.00 0.00
Contig1 7101 <pool1> 0.00 0.00 0.05 0.95 0.00 0.00 0.04 0.13 0.22 0.61 0.00 0.00
...
High FST positions output file
A tabulated file with contig, position on the contig, and between-pool FST at this position for all positions with FST higher than the threshold specified with --min-fst
.
Contig Position Fst
Contig1 4156 0.3333
Contig1 4157 0.6000
Contig1 4158 0.6000
Contig1 4159 0.3333
...
In this section, we will briefly describe the process implemented in PSASS to compute FST and pool-specific SNPs.
The input of PSASS analyze
is a file that contains nucleotide counts for each pool and each position in the reference genome. PSASS parses this input and computes:
- FST between the two pools for each position, using the formula implemented in Popoolation2.
- FST between pools in a sliding window, using the estimate defined in Karlsson et al., 2007 (in Supplementary information).
- The position of pool-specific SNPs. A pool-specific SNP is called at a given genomic position if, for any possible nucleotide, the frequency of this nucleotide is 1 in one pool and 0.5 in the other pool. Values for these thresholds can be adjusted with command-line arguments, and a range can be given; by default these values are: 1 - [0-0.05] in the homozygous pool, and 0.5 +/- 0.1 in the heterozygous pool).
- Number of pool-specific SNPs in a sliding window, using the definition from above.
Therefore, the definition of FST and pool-specific SNPs used in PSASS is only base on comparing the pools themselves; the reference used for the alignments is not used to call SNPs.
The convert function converts a pileup file generated by samtools mpileup into a psass pileup file. It has been replaced by the psass pileup
command.
Usage: psass convert [OPTIONS] INPUT
Arguments:
Argument | Type | Description | Default |
---|---|---|---|
INPUT | file |
Either the path to a samtools pileup output file or "-" for stdin | |
--output-file, -o | string |
Write output to this file instead of stdout | |
--help | Display help message |