Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI and Path specifier classes for input/output resources. #34

Merged
merged 3 commits into from
Feb 13, 2019

Conversation

cmnbroad
Copy link
Collaborator

Not sure if there are better names or package locations.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Oct 29, 2018

Arghh... the tests fail on Windows because I hardcoded "/" in my test path names. Are we really supporting Windows ?

// URIs with an embedded fragemnt delimiter ("#"); if the file scheme is included, the rest of the path
// must already be escaped; ifno file scheme is included, the path will be escaped by the URI class
{"../data/utils/testDirWith#InName/testTextFile.txt", "Test file."},
{"file:///" + getCWD() + "../data/utils/testDirWith%23InName/testTextFile.txt", "Test file."},
Copy link
Collaborator Author

@cmnbroad cmnbroad Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to resolve how to reference these paths without "..". Need to resolve #19.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can use Paths.get(getCWD() + "../data/utils/testDirWith%23InName/testTextFile.txt").toUri().toString(), if the problem is with Windows. I can test if you want locally and let you know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several problems here; the original one about where/how we should store test data, and the Windows one where the path separator is different. The latter has two parts 1) how we specify paths in tests 2) How we test Windows paths (UNC paths, dos drive specifiers, encoded names with embedded spaces etc).

Also, I know you're not yet convinced we need these classes, but for these tests in particular, I want to make sure the raw strings are passed directly to the classes being tested, not strings that have been pre-processed by Paths and URI, since I'm trying to test that the classes themselves do this correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is the issue that Windows paths (with forward slashes) don't appear to be valid URIs ?

@magicDGS
Copy link
Member

@cmnbroad - I would like to support Windows, also because of the hardcoded "/" that are all around htsjdk2. One of the major points of supporting Windows is that, as we are using java, it will be nice to be multi platform (and I have a project in mind for a desktop app that I will test mostly in Windows). There is no much to do to support windows if we don't use hardcoded "/", as far as we are using java.nio.Path.

By the way, I will start the review now.

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit against the class in the first place because this looks more like a way to distinguish when a passed String should be considered a path or something else that should be handled with a different reader (record-provider) implementation. I think that handle URI-specific thinks like this should be done at the reader-factory level; we can provide a factory for GenomicDB that takes the URI with the scheme and handle a Path as if it has it (as a different artifact or in a different repository). Then GATK or other framework that requires to provide a common implementation for handling resources could do their own URI class, which I think that it's out-of-scope for htsjdk3 itself.

@@ -0,0 +1,86 @@
package org.htsjdk.core.api;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that a better place will be org.htsjdk.core.api.io, to keep things more organized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* Interface representing htsjdk-next input/output resources as URIs.
*/
public interface PathURI {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this will represent a resource, what's about IOResource? I'm not sure if it is better, but I don't like the PathURI name...

Copy link
Collaborator Author

@cmnbroad cmnbroad Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't love the names. IOResource might be an improvement - we'd still need a name for the default implementation class (currently called PathSpecifier for lack of a better name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to IOResource; still not sure about PathSpecifier.

/**
* @return true if this URI has a scheme that has an installed NIO file system provider. This does not
* guarantee the URI can be converted into a {@code java.nio.file.Path}, since the URI can be syntactically
* valid, and specify a valid file system provider, but still fail to be semantically meaningful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have properly formatted javadocs. Most of the information should be at the begining, and the @return tag should contain just the information about the return value (in this case, {@code true} if the URI has a installed NIO filesystem provider; {@code false} otherwise).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* guarantee the URI can be converted into a {@code java.nio.file.Path}, since the URI can be syntactically
* valid, and specify a valid file system provider, but still fail to be semantically meaningful.
*/
boolean isNIO();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the new disq, I don't like this isNio name in the method. I prefer something like hasFileSystem or throw a meaningful exception if a Path could not be retrieved to be catched by client code...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients can ignore this and do exactly as you describe (just call toPath, which will throw).

*
* hdfs://namenode/to/file
*
* The current implementation returns false for these cases (toPath will fail, getInvalidPathReason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no current implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Override
public OutputStream getOutputStream() {
if (!isPath()) {
throw new HtsjdkException(getToPathFailureReason());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No raw HtsjdkException

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - perhaps we need a separate discussion about how we want to partition the exception namespace. #36.

try {
return Files.newOutputStream(resourcePath);
} catch (IOException e) {
throw new HtsjdkException(String.format("Could not open output stream for %s (as URI %s)", getRawInputString(), getURIString()), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No raw HtsjdkException

@@ -0,0 +1,254 @@
package org.htsjdk.core.utils;

//TODO: NAMING: is there any useful distinction between Unit/Integration tests in this repo ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should set this naming conventions, as we did not though about that yet.

{"file:project/gvcf-pcr/23232_1#1/1.g.vcf.gz", "file:project/gvcf-pcr/23232_1#1/1.g.vcf.gz", true, false},
{"file:/project/gvcf-pcr/23232_1#1/1.g.vcf.gz", "file:/project/gvcf-pcr/23232_1#1/1.g.vcf.gz", true, false},

{"gendb://somegdb", "gendb://somegdb", false, false},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this is the main reason to have the class in the first place (see the overall comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my overall comment describing the reasons for proposing these classes.

@@ -0,0 +1,2 @@
Test file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also decide on the naming conventions for test data. Also, as I commented before, it will be nice that the test-resources are located in the src to be packaged in the test artifact; otherwise, test won't run if the test jar is isolated.

@@ -30,6 +30,9 @@ subprojects {

dependencies {
testCompile "org.testng:testng:6.14.3"
testCompile "com.google.jimfs:jimfs:1.1"

compile 'commons-io:commons-io:2.5'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention: is this needed in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - its left over from previous changes and can be removed.

@cmnbroad
Copy link
Collaborator Author

@magicDGS thanks for the review comments - I'll go through these individually, but I created #35 to discuss issues related to Windows since I think there are likely some other issues to resolve.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Oct 31, 2018

I think there is value in having an interface thats more strongly typed than String, and more general than File or even Path, for specifying input/output resources in the new interfaces. Using URI as the specifier syntax is compatible with regular old local file paths, and allows for extensibility via query params and custom schemes. The classes proposed in this PR provide both an interface and a default concrete implementation that can handle most cases (local files and other NIO provider-based file systems), but which can also be extended via overrides to allow consumers to provide custom resolution to streams.

@magicDGS
Copy link
Member

I have a proposal for the IOResource and how to provide other implementations by downstream projects. I don't know the rest of @samtools/htsjdk-next-maintainers, but I prefer to keep it simpler (e.g., only java.nio.Path implementation) and allow better extensibility. The implementation in this PR looks a bit cumbersome.

In #42 I included a proposal; maybe we can discuss on that PR or just here. I did not add to the design a method for the getting the raw/string representation, but actually that could make sense to be able to retrieve it. I will copy-paste the generated HTML from #42 here to show what I propose.

@magicDGS
Copy link
Member

Sorry, I could only get a temporary PDF because I had some problems with the computer. But it should be enough for the discussion: IOResourceProvider.pdf

@lbergelson
Copy link
Member

@magicDGS I'm sorry I haven't gotten a chance to comment on this properly yet.

I think having a fairly lightweight interface that is essentially a more convenient URI is a good thing to have. My thought is that we have essentially 2 different things that need to be referred to by uri's of some sort.

  1. is a file system like thing that can be opened and read as a block of bytes. I think we should support these with pluggable nio Path providers.
  2. is something isn't readable as a block of bytes, these are things like SRI accessions and GenomicsDB. We will want another layer of plugins that can understand these.
    I think we want an interface below these that can be plugged into either layer as an identifier. Your proposal seems closer to what I'm imaging as the layer above this interface, the reader layer which will support both pluggable file systems and pluggable readers. Particularly, I think the concept of companion files is a really important one that has been poorly implemented in htsjdk currently, but I think we still need a path representation for a single file.

@magicDGS
Copy link
Member

@lbergelson - I am designing (like in my other comment, in my mind by now due to lack of time for personal reasons) the whole interface for IO in a similar way as #42 (that was just a proof-of-concept on how to write and propose this kind of things in a live document in GitHub, which will be nice for several reasons that we can discuss in Slack):

  1. Low-level IO: as the one that I described in Add docs project for technical documentation #42. This is the layer which contains how to read byte-data from different URIs or path. As I explain there, the main idea is to have a factory which provide different kind of resources. We can provide the one for java.nio.Path in the core, and then have other projects (in this repo or somewhere else) to provide on top of them a resource for GenomicsDB URIs or SRA accessions (or even configuration files with bundled data - resource/index/etc). For the purpose of an user, this is the layer that we don't expect to be implemented outside this couple of examples (or a new resource format, I don't know, based on HDF5 or something that crazy).
  2. Low-level record: by now, just an Record marker interface (I am thinking how to abstract record metadata properly for advance usage). This will be the base class for concrete records, such reads or variants. I am against using the Locatable as a record, because it might be used for other purposes too (e.g., FASTA-records, which are not locating themselves on themselves). We expect that this is just used as a marker but not really containing too much logic - it is the bridge between the low-level IO and the high-level IO to faciltate its implementation. This includes the codec interface, that requires only the concrete implementation.
  3. High-level IO: layer between the low-level IO and the low-level record. That will be like a record provider, including factories and factory-builders. This will be the main extensible point for API users, that could implement their own way to translate the low-level IO into records: from the byte-stream, parse the FASTQ/BED/SAM/BAM/CRAM or whatever format.
  4. High-level record. This are the specialized records that should be implemented, including the codecs.

The idea behind this design is:

  • An user implementing a new way of reading byte-data, should only implement a new resource that pass a stream to the codecs that is understandable (unless specific parsing is required).
  • An user interested in a new kind of file should only implement the record/codec and register it. They should not care about the byte data at all and from where it comes, just for parsing (except special cases).
  • Both could be combined for more complex data sources.

public class PathSpecifier implements IOResource, Serializable {
private static final long serialVersionUID = 1L;

private final String rawInputString; // raw input string provided by th user; may or may not have a scheme

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final String rawInputString; // raw input string provided by th user; may or may not have a scheme
private final String rawInputString; // raw input string provided by the user; may or may not have a scheme


private final String rawInputString; // raw input string provided by th user; may or may not have a scheme
private final URI uri; // working URI; always has a scheme (assume "file" if not provided)
private transient Path cachedPath; // cache the Path associated with this URI if its "Path-able"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private transient Path cachedPath; // cache the Path associated with this URI if its "Path-able"
private transient Path cachedPath; // cache the Path associated with this URI if it's "Path-able"

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the name of progress I approve this PR.

@lbergelson lbergelson merged commit 5924ef3 into samtools:master Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants