-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338
Comments
Just to be clear, are you calling this "cat" as file1.bed.gz, file2.bed.gz, file3.bed.gz etc are internally sorted with the first item in file2 coming after the last in file1, and so on? (Eg chr1.bed, chr2.bed, etc)? I could understand that being called "cat". (Edit: but that doesn't feel like interleaving to me.) Or are you saying that file1.bed, file2.bed,... are sorted and may overlap, and the function just takes the next record from each, whichever should be sorted first? This is the impression I get when you say "interleave". If so that's "merge" and not "cat". Both choices have valid use cases IMO. |
Merge is indeed the word I was looking for. file1.bed and file2.bed are sorted, combine them so the output is still sorted (by only sorting the needed blocks and not the whole files). So this can be avoided (mainly the full sort step):
So basically |
I initially had the impression you meant a way to make a single index file that would enable queries over the set of input files (in place), doing some kind of gathered interleaving of records read directly from the individual input files. While just about plausible, this would be rather impractical to implement! A form of Because The major wrinkle is in comparing |
It's an interesting one. For SAM and VCF of course there are already dedicated merge tools. We could do another for BED, but tabix is totally generic. It has configurations stating which fields to index on. However it strikes me there is one thing missing here in that configuration - collation order. I think tabix can index something working on the assumption that the data is sorted and therefore it's just recording the change from one chromosome to another, but I don't think it would care if that order was chr 1,2,3,...9,10,11 or chr 1,10,11,12...,19,2,20,21,... (ie ascii sort order rather than alphanumeric). That works because the entire index is loaded into memory and a genome chromosome query just fetches that part of index. Maybe I'm wrong here, but I'm pretty sure our indices have no notion of collation. However, you do need to know the collation order when you're merging. For SAM and VCF that's done by the order of the SQ or contig lines in the headers, but no such thing exists in BED (for example), or any other arbitrary tab delimited file. Therefore this would require some substantial tabix additions I think, either a general purpose header parser capability of extracting file specific collation order, or a general purpose mechanism to describe collation orders. Edit: I guess if the files you're merging have themselves already been indexed, then you have a file specific collation order already defined in each index. I guess that's what John is referring to by the |
@jkbonfield would that still be true when usage of this feature would require index files to be there (and get the ordering from there)? |
If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so. A new tool however really helps if you want to do this on parts of the data, which implies indices. That's a (simpler) on-the-fly alternative to the multi-file indexing problem that John mentioned above. I can see the benefit of having such a tool. It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least. |
In my case the bgzipped BED files are several GB each. I assumed a merging approach should be much faster than a naive full sort of the data. |
Maybe :-) It feels more like an improved version of Unix Anyway in the short term you should probably use bedtools merge: https://bedtools.readthedocs.io/en/latest/content/tools/merge.html |
|
I was looking for using it on whole files. An example for which it would be useful, would be to merge
Any chance it would be considered now? |
Unfortunately we've got 2 new projects on the go in addition to samtools (and the team hasn't been given any more resources) so our resources are stretched. So we can't commit to any specific time line, if ever sadly. It is a reasonable idea though. |
It is a pity your are not getting more resources. Hopefully the situation improves in the future. |
Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting.
The text was updated successfully, but these errors were encountered: