"Product" formats/natures (formats that are both X and Y) #103

julik · 2018-04-17T20:49:50Z

We have formats that parse ambiguously. For example, a Keynote document is a JPEG "at the head" and a ZIP with a specific structure "at the tail". A CR2 is a TIFF until considered otherwise. A TIFF is somewhat CR2-ish until considered otherwise. An Office document is a ZIP initially...

The number of these is only ever going to increase (see the library grounding principles). Currently we are at the stage where we litter the code with workarounds like "if this is also a CR2, bail out", "if this is also a ZIP, it is a Keynote file so bail out..." and so forth. What if, instead of doing this, we were to do the following:

Apply all the low level parsers, always
Apply some "folder" or "matcher" strategy to the flat list of results. For example, if something is matched as a JPEG and a ZIP and has a specific file structure we can assume it is Keynote. We then take the two results and smash them together into one which states the Keynote file type unambiguously. If we see the Office ZIP filenames in the file we convert the result into a Word file result
We return the "folder" list to the caller.

So the procedure would look somewhat like this:

initial_results = parsers.map {|p| p.call(io) } #=> [JPEG, ZIP]
results_with_complex_types = fold_complex_filetypes(initial_results) # => [Keynote]

This does clash with the idea of parsing "at most as many parsers as was requested" but we would get much more intuitive operation in return, and we could remove quite a few hacks.

The text was updated successfully, but these errors were encountered:

julik added enhancement apidesign labels Apr 17, 2018

julik changed the title ~~Idea: "product" formats/natures~~ "Product" formats/natures (formats that are both X and Y) Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Product" formats/natures (formats that are both X and Y) #103

"Product" formats/natures (formats that are both X and Y) #103

julik commented Apr 17, 2018

"Product" formats/natures (formats that are both X and Y) #103

"Product" formats/natures (formats that are both X and Y) #103

Comments

julik commented Apr 17, 2018