Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal implementation of IOResource #45

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

magicDGS
Copy link
Member

Following the discussion today, I did some changes to the idea in #42 and provide an implementation instead of a document to show my point. The idea is:

  • Each resource is identified by the URI scheme. file/null schemes are reserved for default, which is java.nio.Path. This has the following advantages:
    • Non-path resources can include a Path handled by java.nio.Path. For example, an use case is the GenomicDB, that could be specified as gendb:hdfs://${path} or gendb:${local_path} or SRA accessions/files.
    • Supports windows paths, which cannot be converted to URI
    • Out-of-the-box support for installed FileSystemProviders
  • Each resource has the responsibility of define how they read/write the data. This is designed as a low-level abstraction to hide the implementation, and most of the library users will not require to cast the interfaces.
  • Resources should be retrieved ONLY through the IOResourceFactory get methods. Registering of IOResourceProvider could be done by either the service API or directly in the factory. Default implementation (java.nio.Path) is not exposed, and could not be overwriten. This allows to fallback to the java.nio.Path in every case, and to be flexible enough to get users define their own providers.

A custom implementation of an IOResource might look as this:

public SRAAccession implements IOResource<SRAReader, Object> {
    private final String accNumber;
    public void SRAResource(final String accNumber) { this.accNumber = accNumber; };
    public String uriIdentifier() { return "sra" };
    public SRAReader openReader() throws IOException { return new SRAReader(accNumber) };
    public Object openWriter() throws IOException { throws IOException("Accession cannot be overwritten") };

And a consumer will use it in this way (too many assumptions in this case, because we don't have any idea on how we will implement the high-level record interfaces, but the reader/writer will take care if the IOResource - scheme and/or input/output):

final RecordReader<Read> readerBuilder = new RecordReaderBuilder(Read.class)
                                          .addReadsSource(IOResourceFactory.get("sra:SRA011223"))
                                          .build();
final RecordReader<Read> readerBuilder = new RecordReaderBuilder(Read.class)
                                          .addReadsSource(IOResourceFactory.get("~/reads/sample1.cram"))
                                          .addRefereceSource(IOResourceFactory.get("hdfs://example.com/references/hg19.fasta")
                                          .build();

This is the alternative that I have in mind instead of #34.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jan 16, 2019

One observation: there isn't actually that much overlap between this PR and mine (though there is some), and in some ways, I think the two branches complement each other.

Much of my PR is about taking a raw, unstructured input string provided by a user, and turning it into a structured (PathSpecifier/URI) object that is always guaranteed to have a valid scheme, and which can subsequently be used to locate a "reader" that can handle that scheme. The reader resolution service in progress in my other branch(not yet a PR) looks remarkably like the one in this branch, except that the providers are at a higher level than IOResource is here; they're more like a SAMReader. There is a registry service that, given a PathSpecifier, queries registered providers to find one that claims to be able render records (of whatever type is being requested) from that PathSpecifier. The winner is then instantiated using the same PathSpecifier.

Having said all that:

  • Having something more strongly typed than "string", that maintains the invariant that there is always a scheme, as the "identifier" currency is super useful. It enables finding a matching plugin; finding sibling files, etc.
  • We should try to preserve the "raw input string -> URI" capability present in the other PR; there are a lot of various cases that make this challenging (see all the test cases in my branch), and we should provide a single, canonical way to do that transformation.
  • The actual "readers" (by which I mean something like a SAMReader) will need to know the originating identifier/source, so they can use it for error reporting, sibling resolution, etc. I also imagined they would need some service that can be used to turn a PathSpecifier, which is an identifier, into something more concrete, such as a stream. The "IOResourceProvider" service in this branch is basically that middle layer. So I think it should be possible to reconcile these two branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants