You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have formats that parse ambiguously. For example, a Keynote document is a JPEG "at the head" and a ZIP with a specific structure "at the tail". A CR2 is a TIFF until considered otherwise. A TIFF is somewhat CR2-ish until considered otherwise. An Office document is a ZIP initially...
The number of these is only ever going to increase (see the library grounding principles). Currently we are at the stage where we litter the code with workarounds like "if this is also a CR2, bail out", "if this is also a ZIP, it is a Keynote file so bail out..." and so forth. What if, instead of doing this, we were to do the following:
Apply all the low level parsers, always
Apply some "folder" or "matcher" strategy to the flat list of results. For example, if something is matched as a JPEG and a ZIP and has a specific file structure we can assume it is Keynote. We then take the two results and smash them together into one which states the Keynote file type unambiguously. If we see the Office ZIP filenames in the file we convert the result into a Word file result
This does clash with the idea of parsing "at most as many parsers as was requested" but we would get much more intuitive operation in return, and we could remove quite a few hacks.
The text was updated successfully, but these errors were encountered:
We have formats that parse ambiguously. For example, a Keynote document is a JPEG "at the head" and a ZIP with a specific structure "at the tail". A CR2 is a TIFF until considered otherwise. A TIFF is somewhat CR2-ish until considered otherwise. An Office document is a ZIP initially...
The number of these is only ever going to increase (see the library grounding principles). Currently we are at the stage where we litter the code with workarounds like "if this is also a CR2, bail out", "if this is also a ZIP, it is a Keynote file so bail out..." and so forth. What if, instead of doing this, we were to do the following:
So the procedure would look somewhat like this:
This does clash with the idea of parsing "at most as many parsers as was requested" but we would get much more intuitive operation in return, and we could remove quite a few hacks.
The text was updated successfully, but these errors were encountered: