-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON schema #44
Comments
@ondovb This looks like a great start! A few items we've been tracking here that it'd be great to include (we were literally just whiteboarding this):
A few more tactical items and questions:
|
@boydgreenfield Thanks for the feedback.
Counts make sense, and are in the Mash Cap'n Proto. What kind of trimming parameters did you have in mind?
Seems like as good a way as any.
Usually, but for Mash it could also be a fasta tag if
It is the raw sequence length (or, for read sets, a genome size estimate based on k-mer content). We only use this for p-value calculation. The number of valid k-mers is currently not tracked in Mash, but I agree that allowing more information about this would be better.
I assume you mean only allowing nucleotide/protein options? For Mash we kept it as generic as possible in case someone would want to do text mining or something, but I could see the argument for that.
Makes sense. Regarding the hash function itself, I have it as an enum but am not sure if that's the best way. It would ensure specificity but would also require a new schema version to use any other hash function. In general we could also allow additional fields ( |
The suggested JSON format appears to be extremely inefficient. BBMap's sketches look like this: #SZ:30 CD:AD GS:1430 ID:393251 NM:Paenibacillus nanensis NM0:gi|343200804|ref|NR_041491.1| Paenibacillus nanensis strain MX2-3 16S ribosomal RNA gene, partial sequence ...etc. They are coded in 2-bit format ASCII-48 with delta-compression so even when you gzip them the size is only reduced by ~30%. So, they are extremely efficient to store and load. I suggest, if the goal is to make an efficient interoperable standard, that you adopt something similar and abandon JSON. |
@bbushnell : binary formats are pretty much always more efficient than text formats, and I'd expect respective tools looking for performance to use their own representation. IIUIC this is trying to be an initial attempt at having a data exchange format between tools. I'd say that JSON is attractive because of the ubiquitous availability of tools around it (libraries exist for all languges) and as it defines basic structures like arrays and key-value maps could let focus on the content first (what should that format contain). As use-cases emerge may the there might be a need to optimize, may be in incremental steps (2bit packing k-mers although this limited to {A,(T|U),C,G} sequences), encoding arrays with the minhash sketches as bytes-packed strings, etc...), but this would be for later ? |
BBMap's sketch format is text, not binary, as you can see from my post - that is the exact, literal, first 9 lines of a BBMap sketch. Binary might be more efficient, but then you can't look at the sketches in a text editor, so I'm not really interested in that. They already are only 150% of their gzipped size, so I don't think it's much of a problem. |
Oh - as for noncanonical bases... yes, you're right. My sketch format can only accommodate ACGT. I don't think this is a problem, though, because... well, what is the goal of sketches? It's to rapidly evaluate whether sequences are similar. Does anyone care whether you have a poly-N sequence that matches everything? ...no. |
@bbushnell I agree with @lgautier that interoperability is the primary goal of this effort, above efficiency. I think the point is that if parsing requires any other custom code or less-than-mainstream libraries, one might as well maximize efficiency with a binary format. This was certainly the motivation behind our use of Cap'n Proto serialization (which actually does provide a schema and libraries for several languages). The ASCII encoding is an interesting middle ground if we want to compress the string within the JSON in the future, but any such solution would have to support the protein alphabet at the very least. |
@boydgreenfield 's suggestion to add count-based trimming is pointing out that in the case of DNA, RNA, or protein data the definition of a minhash sketch extends beyond the definition of an hash function (which the redundancy in sharing k-mers/n-tuples and associated hash values would empirically verify when sharing a sketch) and should cover a bit the nature of the data shared and associated pre-processing leading to the minhash sketch. In a way this is part of the "metadata" that was also suggested to be added. Fully defining it is a complex problem that should probably stay out), but at the same it the information might be important to make use of the sketch / signature (one of the reason they are exchanged in the first place). For example, whether a DNA minhash sketch is build from a complete assembled genome or reads from shotgun sequencing for a given genome would have an influence on what a minhash sketch means or could be used. I am more specifically thinking of the use-case where the subset of kmers constituted by a sketch is used to query a database / service about whether they have a matching signature. With a convention the server might be able to answer the best way (e.g., prioritize / adjust threshold when using search). I have the initial feeling that while this is looking like opening a Pandora's box, but I also think that major use-cases can be defined/covered well enough to have a practical exchange format. Would the notion of hash value-level metadata and minhash sketch-level metaa seems like a interesting starting point ?
|
I think that agree with @ondovb : the alphabet is defining defining explicitly the space of k-mers / n-grams. Not space optimal (e.g. all amino acids repeated with each minhash sketch of polypeptides) but the mihash sketch is likely taking much more space anyway. It would also allow exotic bases, and all sort oddities synthetic biology can be coming up with. |
The definition of the hash function can be relaxed to being a string. There can be common-agreed-upon hashing function, but even so the redundancy of sharing hash values along with their originating k-mers/n-tuples is there to empirically double-check it. |
That's fine. All I care about is efficiency, which is the point of min-hash-sketch. I'm surprised that you guys are willing to compromise efficiency for a basically intangible benefit of interoperability which may or may not happen. Good luck! |
@ondovb |
@lgautier @ondovb I agree on both fronts re: the @bbushnell I think the point here is to get to something easy enough to use for interoperability, and so we should try to optimize for parse-ability and ease-of-correct-implementation over efficiency. E.g., we've actually been storing all of our min-hashes as binary data in Postgres. |
If you are interested in interoperability, doesn't it make more sense to store data as text? Personally, I consider binary formats to be inherently non-interoperable. |
To emphasize this - I have written a lot of tools. All of them support text formats. I have zero interest in writing programs to read custom binary formats that are language-specific or format-specific, when they are less efficient than a text-based protocol. |
@bbshnell May be a slight misunderstanding here. While text vs binary was may be not the best way to describe it, it was (inaccurately) implied that your format was the binary one. In other words everyone has a text format, and this is not why JSON is considered. |
@boydgreenfield Yes, I was suggesting the |
I've updated
We plan on updating Mash to read and write the format as proposed soon, but others are welcome to continue working on standards related to metadata or to create a shared repo. For a name, I would like to propose Jam (JSON MinHash), in keeping with the edibles theme :P |
"Jam" has a nice ring has it can also mean an informal and spontaneous musical performance. Visibility in search engines might be an other matter though. I am about ready to write read/write code for that format but I have a question about the license for the JSON definition being discussed: what is it released under ? (CC-like would seem to make sense). |
@lgautier Public domain (I'm a govt employee). If someone else wants to open a repo and merge contributions from others, then I'd vote for CC0. |
Thanks. Public domain is good to start. We can see if need for anything else because of contributions or so later on... |
In case anyone is looking for the schema: the URL at the top of this thread appears no longer valid. It is here: https://github.com/marbl/Mash/blob/master/src/mash/schema-1.0.0.json |
A first pass of the JSON schema is in the Mash repo:
https://github.com/marbl/Mash/blob/master/src/mash/schema.json
For now, I put k-mers as a separate array parallel to hashes rather than an array of tuples, since the latter seemed unwieldy, especially if they are optional.
The text was updated successfully, but these errors were encountered: