-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for pgen files to include a version number in the header #258
Comments
In the current design, the storage-mode (3rd) and header-format byte(s) (usually just the 12th byte, but allowed by the spec to be multibyte) serve this function. (Note that variant-major .bed files are treated as an old pgen version under this scheme.) |
One explicit question raised by the current design is, what happens when the multiallelic-dosage storage format is finalized? I have tried to set things up so that an older pgen reader will be able to read multiallelic possibly-phased hardcalls from a future pgen generated by a writer that has also chosen to store dosages. This is similar to how one can write a working pgen reader that basically ignores all dosage-related parts of the specification, or all dosage- and all phase-related parts, or all dosage- and all phase- and all-multiallelic-variant-related parts. So there is one future backward-compatible update that is essentially hardcoded into the current spec. But the general case is handled by defining new storage-modes, which may or may not have nice backward-compatibility properties. E.g. the existing pgenlib code includes comments expressing an intent to make 0x11 identical to 0x10 except for a new phase-set data track; if something like that ends up happening, that amounts to a mostly-backward-compatible version update, and it won't be difficult to patch an existing reader to accept the new version while ignoring the new data track. |
I am open to making a near-future spec change defining storage-mode 0x11 as "0x10 with a version number", with guarantees on what won't change with each type of version number update. |
It does seem like your last suggestion (0x11 as "0x10 with a version number") would be a step in the right direction, though in the best case I would even go one step further, and put a real spec version number right after the magic number in the header. Also, as a related issue, I also would have liked to have some way for my pgen writer to tag the pgen files I generate as being sourced by my writer (as opposed to by plink directly), in order to distinguish them in the future in case there is any issue. In my case, its mostly pgenlib code doing the actual writing anyway, but I have a C++ layer in front of that, and a Java JNI layer in front of that. So I settled for stamping the VCF header in the accompanying .pvar, similar to what plink2 does, but with my code's version number in it, i.e, |
Planning to make the following addition to the specification soon (and release the corresponding plink2 / pgenlibr / Python pgenlib forward-compatibility updates); let me know if you see any problems. 0x11: Mode 0x10 with ignorable extensions. This adds a few bytes to the end of the header, possibly a few bytes to the end of the .pgen file, and can in principle introduce references to other files. The body of the header (outside this third byte) is as in mode 0x10. The following is appended:
The body of the footer corresponds to (4) followed by (5) above. The PGEN writer identifier is a UTF-8 string, with no terminator. 0x21: Mode 0x20 with ignorable extensions. Header file has mode byte 0x31, extensions work the same way as for mode 0x11. |
Seems reasonable enough - thanks for the updates. Will the 0x2 PGEN writer identifier be exposed via the api, i.e., |
Yes, this will be exposed soon in the C/C++ API. |
Specification has been updated. Sample Sample reading logic (can be invoked by --pgen-info): |
We (Broad Institute/All of Us) are planning to start generating some large pgen datasets using a pgen writer that we've implemented, and are a bit concerned that the pgen file format doesn't seem to include an embedded version number. Our pgen writer code is based on an alpha version of plink2, and we're concerned that if the file format definition ever needs to change, there is no way for plink2 or other consumers to detect that a given file is from a past or future, and is therefore incompatible with the executing code.
Is there any way to handle this currently, and if not, can a version number be added ?
The text was updated successfully, but these errors were encountered: