-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cram support #54
Comments
I have no objection to CRAM support, it's just that I don't use it and the spec is pretty big. In the long run I guess we should support it. Would you like to write up a proposal for the API here? |
Yes, I will write up a proposal here. A cursory look shows that a cursory look is insufficient for me to get a good handle on what is needed, but I'll tackle this in January. |
For the codecs I listed above, do you agree that BinaryUnmarshaler is the obvious interface? there will be a lot of machinery around that for CRAM, but implementing each of those is no small task either... |
That seems fair in the first instance. |
Does that have implications on the bam writing method name for hts/bam and the implementation of BinaryUnmarshaler for sam.Record? |
I'm not sure.
might be a better interface unless we unsafely convert the []bytes. After that, I was struck by the amount of work this will be. |
Indeed. I spent a day reading and sketching about a year back and decided I didn't care enough. |
I'll note that the Sanger Institute now produces cram as our "raw" sequencing data, that the EBI's sequencing archive is pushing toward cram (and may have to go back and convert everything to cram to save space), and major collaborative projects tend to use cram as their interchange format. Sooner rather than later, cram support is going to be vital. I'd be interested in investigating the use of this go API for use at the Sanger, but without cram support it's a non-starter. I'd suggest looking at how C htslib added sort-of-seamless cram support (in a relatively short amount of time?), and make cram/bam support in this API even more seamless. |
@sb10 I had a look at the htslib cram fs and it is not trivial - as is the case for the spec. Yes, it's something that we should have (hence this issue), but developer resources are limited. Just the codec code is 6.3ksloc of dense htslib-style C and the entire fs is 14.5ksloc. An alternative I used previously would be to interface to the C (this is what sambamba does) - this is not an option here since we want hts to be pure Go and the data representation differences would require significant massaging at the interface. |
The are two C implementations with a shared common history: https://github.com/samtools/htslib/tree/develop/cram Io_lib was the original, which got copied to htslib, but some divergence happened immediately thereafter (htslib-ifying the code basically). Both have been maintained since then. My code can never generate gamma, elias gamma or subexponential coded data and frankly it was an accident that I left in the beta code. Furthermore it only uses huffman in the most trivial case of one single symbol with 0-bit length code. It's entirely up to you which you use for the encoder too - so start with the minimal necessary. I'm not sure which codecs the Java implementation can use, but it's reduced over time. The basic premise now is that ideally the CORE block shouldn't be used at all as it harms partial the decoding ability (eg "I don't care about qualities and aux tags, but get me the rest"). I'd also suggest you use the PS. Learn from my mistake - separate out reference handling into its own module, preferably not CRAM specific as lots of tools need reference querying. It's a complicated enough beast by itself that it warrants being its own thing. I think that code has been my biggest source of bugs! |
@kortschak I agree that remaining pure Go is an important feature. If you do find the time and resources to work on this though, as @jkbonfield alludes to, the current implementations are more important than the spec (which may need to be improved). So just stick to codecs that are already implemented in Go (if possible); James told me the main ones used right now are gzip, bzip2, lzma, rANS, huffman in single-code-only mode (so not huffman at all; just "const"). As long as it passes round-trip tests with samtools, it will be good enough for use. |
@jkbonfield thanks for the insight. so what codecs are seen in the wild? @kortschak maybe biogo/hts could get a summer of code student to work on cram support. |
Here's my immediate method for supporting cram in the short term: type reader struct {
io.ReadCloser
cmd *exec.Cmd
}
func (r *reader) Close() error {
if err := r.cmd.Wait(); err != nil {
return err
}
return r.ReadCloser.Close()
}
// NewReader returns a bam.Reader from any path that samtools can read.
func NewReader(path string, rd int, fasta string) (*bam.Reader, error) {
cmd := exec.Command("samtools", "view", "-T", fasta, "-b", "-u", "-h", path)
cmd.Stderr = os.Stderr
pipe, err := cmd.StdoutPipe()
if err != nil {
return nil, err
}
if err = cmd.Start(); err != nil {
pipe.Close()
return nil, err
}
cr := &reader{ReadCloser: pipe, cmd: cmd}
return bam.NewReader(cr, rd)
} |
Before we have functionality, I'd like to see a proposal for the API. If it is the case that the API will be exactly the same as it is now, them that's fine, but it needs to be discussed. For an idea of what I'm looking for see some of the briefer golang proposals.
|
API-wise, things to consider for reading would be a way to indicate which fields are required by the caller and which are not (eg maybe we don't care about qualities, or aux tags) and whether there is a need to generate MD and NM tags. This is irrelevant in BAM as it costs the same time regardless, but can be significant in CRAM (consider On the encoding side there are more options than BAMs compression level; basic CRAM compression level too of course but also how many reads/bases per slice, whether to do reference based encoding, embedded references or referenceless, what compression codecs you are willing to make use of (zlib for sure, but also maybe bzip2 and/or lzma have their uses and are part of CRAM spec). These (except currently bzip2/lzma) are all controllable directly from samtools command line too via the --input-fmt-option and --output-fmt-option arguments, or as -I and -O format adjustments. Eg |
The issue with the API above is that it takes a path rather than an |
At the moment the Are you thinking of per field or at the level of granularity we have here? |
I have started a stub proposal here that's intentionally vague: https://gist.github.com/brentp/57850c10342c21e9b60fa2f83999464f please let me know what should be expanded and/or changed. |
Sorry to take so long. Would you send that as a proposal PR adding it as |
@brentp ping |
shoot. this slipped past me somehow. will open a PR today. |
@jkbonfield I have made a start on writing the major plumbing for cram support here, working from the spec. I have a few questions.
thanks |
|
Thanks. The issue in 3. is related to "core" blocks. The figure in section 8.6 shows the problem; how can I know where CRAM record n starts? While it may be the case that this block type is avoided, I still need to be able to handle it if the code encounters it. |
The answer is that you don't know where the CRAM record n starts without decoding record 0 to n-1 first as there isn't an index within a block. Each record starts with the next bit in CORE or the next byte in the EXTERNAL blocks, starting from 0 for the first. I believe the original intention in the early days of CRAM was to entirely use CORE and to have a bit-level index so that full random access was possible, but that was before I got involved and it had changed considerably by then. |
OK. So the follow up for that is how does a reader of the bit stream in record n know when that record has finished having reached the beginning of it. This is just a function of the codec behaviour on the bit stream? |
The bits read are entirely defined by the CRAM code structure itself. This is one of the things I struggled with early on and it's still not explicitly defined in the spec I fear for the order that things are encoded/decoded in. This really needs fixing, but it's not something I've had time to do and it's a large task to follow all the code paths and document them (sadly). Basically see https://github.com/samtools/htslib/blob/develop/cram/cram_decode.c#L2386 for the start of the decode loop. It does things like this:
etc Each of the codecs is defined by the compression header. It gets given the core "blk" but the codecs are also able to read from the external blocks. The code structure itself is what defines where a read finishes and the next starts. It ends with decoding of the QS records. We'd welcome spec improvements though. At the very least the tables of data series should be rewritten to be in the same order that the code uses. |
Howdy, is there likely to be CRAM support in the near future? |
There is some work towards it in this branch. It is not likely that there will be CRAM support in the near future; I don't have either the work need or the available time (nor strength to cope with the spec) to do this at the moment. Someone with at least two of the previous three items should be able to pick it up without too much trouble though - a significant portion of the work has been done. |
I have merged the CRAM support branch in. It is still a work in progress. The broad architecture of the format/container handling is implemented, but there needs to be work done on handling the details within blocks and slices. |
For CRAM 4 (ongoing - it's a "technology demo" at the moment rather than a recognised standard) I've been documenting the codecs as pseudocode and producing some simple implementations in Javascript (as a language with minimal syntactic sugar to get in the way of an explanatory algorithm). Sorry for slowly moving the goal posts! Some of those changes though have also trickled down to improving the CRAM 3 documentation. The latest official CRAM spec, incase you haven't downloaded it recently, is therefore far easier to understand for things like rANS implementations due to the addition of the pseudocode. I'd also encourage you to look at the GMOD/cram-js for a clean easy to follow implementation, unlike the C version which is rather convoluted! GMOD/cram-js is read only, but if you want to see my Javascript rANS encoder too then the current progress is over at https://github.com/jkbonfield/hts-specs/blob/cram_reference/cram_reference/rans.js. |
Oh I forgot to add, see https://github.com/samtools/hts-specs/pull/383/files too for the rANS pseudocode in the spec. It needed some edits! |
we are starting to have a push to shift to CRAM over BAM for storage reasons. I know this will be a huge development effort, but perhaps we can start with a few discrete tasks. Here are some obvious ones:
I could prototype some of those if we agree on an API and if there is interest in having cram.
The text was updated successfully, but these errors were encountered: