Skip to content

Julia package to read, write and detect formatted data, based on MIME types

License

Notifications You must be signed in to change notification settings

ofisette/Formats.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Formats

Julia package to read, write and detect formatted data, based on MIME types.

The Formats module provides functions to infer and to specify the format (e.g. image/png) and, optionally, the coding (e.g. application/gzip) associated with a filename or an IO stream. The Formatted objects created by these functions can be passed to the standard IO functions open, close, read and write. Four convenience functions are also provided for interactive use; openf, readf, readf! and writef behave like their counterparts from the Base module, but also detect formatted data automatically. Package developers can integrate their IO routines with Formats by specializing functions and registering the formats or codings they support. If multiple loaded packages support the same formats, a “favorite” system allows choosing a reader/writer. Formats also reports ambiguities, such as multiple formats using the same filename extension.

License

You can use Formats under the terms of the MIT “Expat” License; see LICENSE.md.

Installation

Formats is not a registered package. You can add it to your Julia environment by giving the URL to its repository:

using Pkg
Pkg.add(PackageSpec(url="https:://github.com/ofisette/Formats.jl"))

Documentation

This documentation gives an overview of the types and functions that form Formats’s public interface. For details, refer to the documentation of individual functions and types, available in the REPL. The basic usage section of the documentation is also accessible from the REPL:

?Formats

Formats provides a basic framework to manage formatted data. However, Formats itself does not define any specific format or function to read or write objects in specific formats. All examples below rely on FormatCodec for gzip transcoding, and on Dorothy for reading/writing PDB and GRO molecular structures.

Basic usage

using Formats       # Framework for data formats and codings (this package)
using FormatCodecs  # Codecs for common codings (separate package)

To infer the format of a file from its filename:

f = infer("insulin.pdb")

The coding (such as data compression) is also automatically detected:

f = infer("myoglobin.gro.gz")

To infer a format based on a stream’s content:

f = infer(open("insulin.pdb"))

To specify a known file format and, optionally, a coding:

f1 = specify("insulin.dat", "structure/x-pdb")
f2 = specify("myoglobin.dat", "structure/x-gro", "application/gzip")

To check if a format was specified or inferred:

f = infer("kitten.png")
isspecified(f)  # -> False
isinferred(f)    # -> True
isunknown(f)    # -> False
isambiguous(f)  # -> False

To get the detected or specified format/coding:

h = infer("myoglobin.gro.gz")
getformat(h)  # -> "structure/x-gro"
getcoding(h)  # -> "application/gzip"

Functions infer and specify return a Formatted object which can be passed to read and write:

mol = read(specify("insulin.pdb", "structure/x-pdb"))
write(infer("insulin.gro.gz"), mol)

Function read! targets a pre-allocated output object:

read!(infer("insulin.gro.gz"), mol))

Formatted objects created from a filename can be opened, returning a new Formatted object wrapping the underlying IO stream, and those wrapping IO streams can be closed:

f1 = infer("insulin.pdb")
f2 = open(f1)
close(f2)

Functions infer and specify can be called with an existing Formatted object. This will infer format and coding again, or override previous guesses with the specified information:

f1 = infer("insulin.dat")
f2 = specify(f1, "structure/x-pdb")
f3 = infer(open(f2))

Convenience functions openf, readf, readf! and writef automatically infer format and coding if passed a filename or IO stream, but preserve the existing format/coding information if passed a Formatted object:

mol1 = readf("insulin.pdb")
writef("insulin.gro", mol1)
mol2 = read(openf("myoglobin.gro"))
writef(specify("myoglobin.dat", "structure/x-pdb"), mol2)
readf!("lysozyme.pdb", mol3)

MIME types

A MIME type (also media type or content type) is a two-part identifier for file formats (see https://en.wikipedia.org/wiki/Media_type). Common examples are “application/gzip”, “image/png” and “text/html”. The Base module of Julia defines the MIME parametric type. For convenience, most functions in Formats accept and return strings (e.g. "image/png"), which are converted to and from MIME as necessary.

Guessing and specifying formats and codings

Functions can be used to infer and specify the format/coding associated with a filename or IO stream. These two functions return Formatted (abstract type) objects which wrap the original resource and add format and coding information. A third function, formatted, will infer the format and coding if called with an filename or IO, but will return any Formatted object unchanged, preserving existing format/coding information.

Functions isspecified and isinferred can be used on Formatted objects to check if a format was specified or inferred. Function isunknown tests whether the format of a resource was neither specified nor inferred successfully. If a file extension or stream signature is associated with multiple possible formats, ambiguities can arise; there will be multiple guesses as to the possible format of a resource. This can be checked with isambiguous.

Functions getformat and getcoding can be used on Formatted objects, unless their format is unknown, in which case an error is thrown.

Reading and writing formatted data

Formatted objects can be passed to standard IO functions read, read! and write. In addition, Formatted objects wrapping filenames can be opened, and those wrapping IO streams can be closed. No other IO functions are supported on Formatted objects, but resource can be used to get the underlying filename or IO stream of a Formatted object to query or manipulate it directly.

Formats offers four convenience functions: openf, readf, readf! and writef (where f stands for formatted). These operate just like their counterparts open, read, read! and write, but will automatically infer the file format when acting on a filename or IO stream. When acting on a Formatted object, they will preserve the existing format/coding information.

Adding formats and codings

To integrate your own packages with formats, your first need to add the formats/codings you wish to support to the Formats registry. This is done via the addformat and addcoding functions, which are not exported by default. Multiple registrations of a format or coding will be ignored, so you do not need to worry about other packages also registering the formats you support.

To take advantage of the infer function, you should use addextention and addsignature (not exported by default) to register the filename extensions and stream signatures associated with your formats. Multiple registrations of the same extension or signature for the same format will be ignored. However, associating the same extension or signature to multiple formats will result in a warning since it introduces ambiguities when inferring formats.

Implementing readers, writers and codecs

Once you have registered the necessary formats, extensions and signatures, you need to create a type that identifies your reader/writer. This should be a singleton specializing the FormatHandler abstract type (not exported by default). This new type should then be registered using addreader and addwriter (also not exported by default). Finally, you must specialize read, read! and/or write; the specific signatures are documented.

Codings can be associated with their appropriate decoder/encoder using setencoder and setdecoder (not exported by default). Note that there is no addencoder/adddecoder; only a single decoder and encoder can be associated with a given coding.

Registration inside a module must happen at initialization time

Inside a module, functions that modify the global registry in Formats must be called inside __init__, the optional special function that initializes the module. This means that any call to addformat, addcoding, addextention, addsignature, addreader, addwriter, setencoder, or setdecoder in your packages must happen inside __init__. This is also true of preferreader and preferwriter, but these functions should not be called outside the main environment anyway since choosing readers/writers should be done by the users rather than package developers.

Calling the above-mentionned functions in the global scope of a module rather than inside __init__ will give unexpected results: in the main environment, the Formats registry will be empty. This is because these functions modify the global variable registry in Format from a different module. This must happen at run-time and not when pre-compiling the module.

Selecting a specific reader/writer

When multiple readers or writers are available for a given format, a specific reader or writer can be selected, either for a single format or on a global basis. This is done via preferreader and preferwriter (not exported by default).

See also

  • FormatCodecs: Integrate common transcoders with Formats (recommended).

  • FormatStreams: Read and write series of formatted objects in IO streams.

  • TranscodingStreams: The basis for codings, encoders and decoders in Formats (dependency).

  • FileIO: The inspiration for Formats; a different package that provides similar functionality, but with a more centralized approach.

About

Julia package to read, write and detect formatted data, based on MIME types

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages