You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
graphbin2 doesn't seem to scale very well for large assemblies with large number of contigs. Given that a big fraction of the contigs generated by metaSPAdes are usually small, and there's no contig length cutoff for spades, would it be possible to add a contig length cutoff to graphbin2 (e.g., all contigs <1kb are skipped) in order to speed up the algorithm, or does the algorithm require all contigs in order to function properly?
The text was updated successfully, but these errors were encountered:
I believe that I created a method to pre-filter out all contigs and speed up graphbin2. In order to get the code running effectively, I had to make huge changes, so a PR doesn't make much sense. Some things that I changed in the code that I found to be beneficial for reading & running graphbin2:
Used argparse command => subcommand structure for calling graphbin2_SPAdes.py (or graphbin2_SGA.py) instead of using os.system to call the code. This change greatly helps with debugging exceptions, which an os.system call of a script will not provide
Used the logging package for status output instead of print(), given that at least on some machines, the tqdm stderr output will be written prior to the print stdout, which causes confusion when reading the log
Used "my string {}".format(integer) method for formatting strings
When possible, created specific exceptions (eg., except ValueError) instead of general exceptions (ie., except)
Thank you for the question. GraphBin2 was originally designed to recover short contigs as much as possible. Hence, we did not put introduce a filter for short contigs. However, I understand that this can be a scaling issue with very large datasets. I'm glad you were able to modify the code as you need. Thank you for sharing the details of the things you changed. I will add a fix providing the option to filter out contigs in future.
graphbin2 doesn't seem to scale very well for large assemblies with large number of contigs. Given that a big fraction of the contigs generated by metaSPAdes are usually small, and there's no contig length cutoff for spades, would it be possible to add a contig length cutoff to graphbin2 (e.g., all contigs <1kb are skipped) in order to speed up the algorithm, or does the algorithm require all contigs in order to function properly?
The text was updated successfully, but these errors were encountered: