The purpose of this program is to allow direct comparison of contig and scaffold files from SPAdes outputs. If you want to compare contig and scaffold files now, it is very difficult since the NODE names don't match.
The only dependencies are a couple of Perl modules I regularly use. I think these are actually available in the Perl download, so not sure you really need to install anything. The modules are: strict
, Getopt::Std
, and List::MoreUtils
.
Download the repository from GitHub: contig_scaffold_comparisons
After installation, pull up the help information:
perl contig_to_scaffold_comparitor.pl -h
This will give you:
Help called:
Options:
-c = contigs.paths file from spades
-f = contigs.fasta file from spades
-q = QC'ed contig fasta file
-s = scaffolds.fasta file from spades
-p = scaffolds.paths file from spades
-m = sample header prefix for simplified headers
-l = minimum sequence length to export (Note: will export two files: one with all sequences and one to this minimum length)
-h = This help message
Explanation of inputs:
- contigs.paths, contigs.fasta, scaffolds.fasta, and scaffolds.paths are all automatically produced by SPAdes.
- The QC'ed contig fasta file (
-q
), just a fasta file of the assembly with contigs passing your QC. For me, I QC contigs to remove contigs that don't have at least 1X read coverage over 90% of their length, this for reads mapping to their own assembly (i.e. the reads that went into the assembly in the first place). - Sample header prefix will export a file with the prefix + NODE + the node number (i.e. S1_NODE_1001). This is perfect for use with binning programs.
- The minimum length is helpful if you are looking to put a smaller subset into a binning program.
Here is an example command:
contig_to_scaffold_comparitor.pl -c ../contigs.paths -f ../contigs.fasta -q ../contig_qc/S1_qccontigs_simpname.fasta -s ../scaffolds.fasta -p ../scaffolds.paths -m S1 -l 1000
While running, you should get print statments as you pass waypoints:
Reading contig fasta file . . .
Reading scaffold fasta file . . .
Reading qc fasta file . . .
Reading contig paths file . . .
Reading scaffold paths file . . .
Tagging contigs dictionary with QC pass/fail . . .
Identifying multicontig scaffolds . . .
Adding new headers for single contig scaffolds . . .
Processing multicontig scaffolds . . .
Checking for problems . . .
Saving outfiles . . .
The most problematic part of this comparison is when you get to multicontig scaffolds (scaffolds with more than one contig). Essentially, we are matching scaffold and contig names based on the EDGE names (which stay the same between the two paths files). Sometimes a single contig will consist of multiple EDGEs, so we'll get an error, such that the predicted number of contigs within a scaffold (based on the number of N's in the sequence) is different from the "actual" number (which would ideally be 1:1 EDGE to contig). I've made the program go through one pass to try to fix this, which works when EDGEs are exclusive to contigs, but sometimes two different contigs can share 1 or more EDGEs. This gets a bit complicated to program, so I've made it display the options and have the user pick (I know a bit archaic, but easy enough).
This is an example where the problem was fixed automatically (it will go to the next problematic multicontig scaffold):
WARNING: Scaffold does not match the predicted number of contigs: NODE_131_length_18414_cov_126.217 Predicted = 2 Actual = 11
Scaffold to contig match fixed
Scaffold edges: 15731,19763,37746,37747,41622,56629,68457
Contig match: NODE_3096_length_4310_cov_127.455 NODE_285_length_13566_cov_132.077
Contig edges: 15731,19763,37746,37747,41622,68457 56629
This is an example where the user must decide which matches to keep:
WARNING: Scaffold does not match the predicted number of contigs: NODE_1456_length_6427_cov_12.7279 Predicted = 2 Actual = 3
NODE_1456_length_6427_cov_12.7279 THERE'S STILL A PROBLEM Predicted = 2 Actual = 3
Let's get some human input here:
Scaffold node edges: 7688,7689,17696
Possible matches are numbered:
Option 1: NODE_7615_length_2482_cov_12.448 7688,17696
Option 2: NODE_80098_length_132_cov_33.4 7688
Option 3: NODE_3644_length_3935_cov_13.359 7689
Input contig matches to keep - comma delimited (i.e. 1,4,5)
$
At the $
, the program will wait for input (it doesn't necessarily print a $
). The proper input here is 1,3
, which tells the program to keep options 1 and 3. Why these? There are only 2 contigs in the scaffold, and options 1 and 3 cover all three of the node's EDGEs, so is the proper match.
Here is another example:
WARNING: Scaffold does not match the predicted number of contigs: NODE_1413_length_14616_cov_3.59266 Predicted = 4 Actual = 6
NODE_1413_length_14616_cov_3.59266 THERE'S STILL A PROBLEM Predicted = 4 Actual = 6
Let's get some human input here:
Scaffold node edges: 21777,21778,106655,106657,243153,339709,339711
Possible matches are numbered:
Option 1: NODE_27323_length_3001_cov_4.81663 21777,339709
Option 2: NODE_363734_length_667_cov_6.11667 21778
Option 3: NODE_17531_length_3865_cov_3.11851 106655,106657,243153
Option 4: NODE_442491_length_136_cov_2.55556 106657
Option 5: NODE_434220_length_221_cov_23.5957 243153
Option 6: NODE_5816_length_7053_cov_3.35706 339711
Input contig matches to keep - comma delimited (i.e. 1,4,5)
$
And the answer: 1,2,3,6
This is a bit tedious if there are a lot of multicontig scaffolds with problems. I may fix it in future (but likely not).
Several outfiles are produced:
scaffolds_not_passing_QC.txt
scaffolds_old_new_names.txt
scaffolds_multicontig_stats.txt
scaffolds_newnames.fasta
scaffolds_newnames_simplified.fasta
scaffolds_newnames_QCpass.fasta
scaffolds_newnames_simplified_QCpass.fasta
scaffolds_newnames_QCpass_atleastXbp.fasta
scaffolds_newnames_simplified_QCpass_atleastXbp.fasta
Their purpose:
- List of scaffolds (old and new names, length) that don't contain contigs that pass the original QC run.
- List of scaffold old and new names.
- List of multicontig scaffolds (new names) along with all the contigs that are in the scaffolds (and their QC result).
- FASTA with ALL scaffolds with their new names.
- FASTA with ALL scaffolds with their new simplified names (i.e. S1_NODE_1001).
- FASTA with QC passing scaffolds with their new names.
- FASTA with QC passing scaffolds with their new simplified names.
- Same as #6, except filtering out scaffolds that aren't greater or equal to the min length cutoff.
- Same as #7, except filtering out scaffolds that aren't greater or equal to the min length cutoff.
Good luck!!
Please open an issue if you have any issues or questions. This program is provided without warranty or a guarantee of support. Thanks!!