Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TripleSerializer treatment of prefixes #12

Open
andrewufrank opened this issue May 29, 2014 · 26 comments
Open

TripleSerializer treatment of prefixes #12

andrewufrank opened this issue May 29, 2014 · 26 comments

Comments

@andrewufrank
Copy link

the documentation gives the impression that the conversion from triples to an rdf (eg. TripleGraph) will handle the prefixes which are defined in the namespace. in my tests (and my perusal of the code) this seems not to be the case.
i suggest to update the documentaiton accoringly (or to implement the mapping of prefixes).
thank you for very useful code!
andrew frank

@robstewart57
Copy link
Owner

Duly noted, thanks. I will take a look at this in a few weeks.

@cordawyn
Copy link
Collaborator

cordawyn commented Jun 4, 2014

@robstewart57 I have some time to spare, I can take a look at it in the meantime.

@robstewart57
Copy link
Owner

@cordawyn That'd be great!

@cordawyn
Copy link
Collaborator

cordawyn commented Jun 7, 2014

So here's a summary on the issue:

RDF graphs that are created by parsers have their "namespaced" nodes expanded into full URIs. However, if an RDF graph is created manually (i.e. by building sets of Triples and using mkRdf directly), no conversion is performed ( @andrewufrank , could you confirm that this was how you came across this issue? ). Btw, the same applies to relative URIs - they are not prefixed with the baseURI (but should be).

I tried to hook an automatic namespace expansion to mkRdf (and it worked), but there is a performance issue with it:

  1. Since mkRdf is the first place where I can get hold of all PrefixMappings, I'm getting all Triples already built by that time. I have to "unpack" them, detect and expand UNodes with URIs containing namespaces, wrap it all back into UNodes and Triples.
  2. mkRdf is used by parsers, so this results in a redundant call to the expansion routine and this dramatically reduces the RDF building time. As I mentioned above, parsers already perform that expansion on their own.

So here's what I suggest to do about it:

  1. Give up on "automagic" namespace expansion when building RDF graphs directly via mkRdf (besides, we won't be able to trigger this expansion if TripleGraph is built directly via its constructor, anyway).
  2. Note it in the docs that UNode created manually should have fully-qualified, absolute URIs (so that users don't expect UNode("rdf:type") and UNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") to be equal). Let users take care of the namespace expansion themselves.
  3. Optionally, provide a function to create a new RDF graph by remapping the namespaced graph triples into "expanded" ones.

I also ponder refactoring the whole RDF graph building routine, so as to make it possible to add triples to it iteratively (and perform namespace expansion as they're added). Perhaps, wrap it in some "RDFGraph" (writer) monad and make it work in a similar fashion to building SPARQL queries in hsparql. Sounds cool to me, but this is a big overhaul to the whole rdf4h package, esp. the parser code.

So, what are your thoughts on this?

@robstewart57
Copy link
Owner

Hi,

RDF graphs that are created by parsers have their "namespaced" nodes expanded into full URIs. However, if an RDF graph is created manually (i.e. by building sets of Triples and using mkRdf directly), no conversion is performed.

I agree that this is confusing and will probably lead to confusion differences in RDF graphs.

Since mkRdf is the first place where I can get hold of all PrefixMappings, I'm getting all Triples already built by that time. I have to "unpack" them, detect and expand UNodes with URIs containing namespaces, wrap it all back into UNodes and Triples.

That's less than ideal...

So here's what I suggest to do about it:

Give up on "automagic" namespace expansion when building RDF graphs directly via mkRdf (besides, we won't be able to trigger this expansion if TripleGraph is built directly via its constructor, anyway).

So you're suggesting that a triple passed to mkRdf should look like:

UNode("example:foo")  UNode("rdf:type") UNode("example:Bar")

And extracting this triple from an RDF after mkRdf should look exactly the same? I.e. not

UNode("http://www.example.com/foo")  UNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") UNode("http://www.example.com/Bar")

That's fair enough. However, I feel that in the following three type class methods, the returned Triples should be fully expanded with their namespace prefix. That is, move prefix expansion from mkRdf to these three instead -- when triples are asked for.

triplesOf :: RDF rdf => rdf -> Triples
select    :: RDF rdf => rdf -> NodeSelector -> NodeSelector -> NodeSelector -> Triples
query     :: RDF rdf => rdf -> Maybe Node -> Maybe Node -> Maybe Node -> Triples

Note it in the docs that UNode created manually should have fully-qualified, absolute URIs (so that users don't expect UNode("rdf:type") and UNode("http://www.w3.org/1999/02/22-rdf-syntax-ns#type") to be equal). Let users take care of the namespace expansion themselves.

Or what about splitting unode into two functions:

unode :: Text -> Node
unodePrefixed :: Text -> Text -> Node

I also ponder refactoring the whole RDF graph building routine, so as to make it possible to add triples to it iteratively (and perform namespace expansion as they're added). Perhaps, wrap it in some "RDFGraph" (writer) monad and make it work in a similar fashion to building SPARQL queries in hsparql. Sounds cool to me, but this is a big overhaul to the whole rdf4h package, esp. the parser code.

I'm less certain that such a dramatic change would yield the big benefits required to justify the effort. Feel free to try an implementation of an RDFWriter monad :-)

So, I'm suggesting:

  1. We move namespace expansion from mkRdf to triples, select and query.
  2. We separate unode into unode (e.g. for "rdf:type") and unodePrefixed (e.g. for "http://www.w3.org/1999/02/22-rdf-syntax-ns#type").

Thoughts?

@robstewart57
Copy link
Owner

I should add, to have a far greater confidence in this library I really think we should add the W3C RDF test suite. That is, add them in data/ and add them as HUnit tests. http://www.w3.org/TR/rdf11-testcases .

@cordawyn
Copy link
Collaborator

cordawyn commented Sep 6, 2014

Okay, I'll start with adding the tests first, then switch to the namespace expansion issue.

As a programmer though, I'd prefer to have two nodes equal, namespaced or expanded, whenever we have to compare those instances:

(unode "rdf:type") == (unode "http://www.w3.org/1999/02/22-rdf-syntax-ns#type")

Having 2 node constructors (unode and unodePrefixed) sort of solves the problem, but what we do is actually moving the "namespaced node detection routine" out of the scope of our lib into the lib user's hands. I'd rather this function stays with our lib, to keeps things uniform across packages that might use rdf4h.

As a side note: after reexamining the code in TriplesGraph.hs#mkRdf' (seeing what the comments say) I now realize that your approach (to move the expansion from mkRdf to the other 3 functions) was implied by the original author. We'll just have to keep that in mind when adding new functions that work with nodes, and remember to add node expansion where needed.

So, to summarize, I'm fine with all of your suggestions, but will try an improved solution based on them ;-)

@robstewart57
Copy link
Owner

I'll start with adding the tests first

Great, thanks. I've added you as a collaborator so you should have direct push access.

I now realize that your approach (to move the expansion from mkRdf to the other 3 functions) was implied by the original author.

OK, does that give us enough confidence then to go ahead and do this?

@cordawyn
Copy link
Collaborator

cordawyn commented Sep 6, 2014

I now realize that your approach (to move the expansion from mkRdf to the other 3 functions) was implied by the original author.

OK, does that give us enough confidence then to go ahead and do this?

I haven't completely given up on the alternative yet, but I think your approach would work just fine for now. We can start implementing your suggestion, yes.

If you want to start right now, feel free to go ahead. Otherwise, I can get down to coding in about a day.

@robstewart57
Copy link
Owner

I'm travelling over the next few days, so please feel free to get started adding the W3C tests and deferring namespace expansion.

@andrewufrank
Copy link
Author

dear rob

thank you for your reply.
i have since changed more, as i worked on an rdf editor (gui) and i have
to rethink the design - the original intenion and what practical
requirement arise.
i will be back to you in a week or so.

andrew

On 09/07/2014 12:33 AM, Rob Stewart wrote:

I'll start with adding the tests first

Great, thanks. I've added you as a collaborator so you should have
direct push access.

I now realize that your approach (to move the expansion from mkRdf
to the other 3 functions) was implied by the original author.

OK, does that give us enough confidence then to go ahead and do this?


Reply to this email directly or view it on GitHub
#12 (comment).

Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
+43 1 58801 12710 direct
Geoinformation, TU Wien +43 1 58801 12700 office
Gusshausstr. 27-29 +43 1 55801 12799 fax
1040 Wien Austria +43 676 419 25 72 mobil

@robstewart57
Copy link
Owner

@cordawyn Hi, how are the W3C unit tests coming along? Do we fail most of them, or pass most of them?

@cordawyn
Copy link
Collaborator

@robstewart57 I got distracted from RDF4H a few weeks ago, so I haven't finished writing that stuff yet. I made a few fixes to our parser to actually read the manifest files that list the tests, define the expected results and so on. You can take a look at my w3tests branch here: https://github.com/cordawyn/rdf4h/tree/w3tests -- there are several commits there. I'll definitely get back to coding the tests, but cannot tell when exactly. If you feel like continuing, you can get that branch for yourself.
Sorry for the lack of updates.

P.S.: As a side note, those fixes are not in line with the "written" W3C specs, but W3C actually has unit tests specifically for those exceptions. So now we should be passing 2 more W3C unit tests and read/parse their manifest files at the same time ;-)

@cordawyn
Copy link
Collaborator

Actually, I think I'll get back to it tomorrow or so. I've just recalled that I was getting to "the good parts" there :-)

@andrewufrank
Copy link
Author

thank you for the progress report. i have looked at the code and made
some (local) changes which i use, but i am not certain if they are of
general benefit. i (intermittently) work on a project, which uses rdf4h
but progress is slow and requirements emerge even slower.

Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
+43 1 58801 12710 direct
Geoinformation, TU Wien +43 1 58801 12700 office
Gusshausstr. 27-29 +43 1 55801 12799 fax
1040 Wien Austria +43 676 419 25 72 mobil

On 11/12/2014 09:44 PM, Slava Kravchenko wrote:

@robstewart57 https://github.com/robstewart57 I got distracted from
RDF4H a few weeks ago, so I haven't finished writing that stuff yet. I
made a few fixes to our parser to actually read the manifest files that
list the tests, define the expected results and so on. You can take a
look at my w3tests branch here:
https://github.com/cordawyn/rdf4h/tree/w3tests -- there are several
commits there. I'll definitely get back to coding the tests, but cannot
tell when exactly. If you feel like continuing, you can get that branch
for yourself.
Sorry for the lack of updates.


Reply to this email directly or view it on GitHub
#12 (comment).

@robstewart57
Copy link
Owner

@cordawyn Your w3tests branch looks like a great start 👍

@andrewufrank I'd be interested in your use case for rdf4h?

I notice that @jutaro has forked to create this commit df671af . @jutaro is there anything else you'd like to commit before I should merge? It'd be nice to avoid duplicating code contributions.

@cordawyn
Copy link
Collaborator

@robstewart57 As a side task for this branch (while I'm busy coding the actual tests), I'm thinking about "automatic" fetching of W3C test files. They take about 6 Mb and I don't think we should place them in the repo (they are probably updated occasionally at W3C server, too). My tests expect them to be placed in data/w3c local dir, so we could write some code to fetch them from the servers of W3C.

I don't know if there's a blessed way to do it in Haskell tests, but I would probably do as follows:

Running cabal test would raise an exception and say something like "Missing data files for W3C test suite. Please run xyz to fetch them, then launch cabal test again."

This xyz thing is something like make fetch_w3c_files or rake test:fetch:w3c (for those coming from Ruby world), which just launches "wget" or "curl" to fetch the files from W3C and place them where needed. I believe "cabal" should have that ability too. Perhaps cabal test --test-option=fetch_w3c is the right way to introduce auxiliary build tasks? As an alternative, we could just write a bash script to perform that downloading.

Re-launching cabal test would now detect the files and start the tests normally.

I hope I don't sound too confusing ;-) Does this all make sense?

@cordawyn
Copy link
Collaborator

I've completed the test suite for turtle parser. Here's the results:

         Test Cases    Total
 Passed  177           177
 Failed  114           114
 Total   291           291

The updated test suite code is available at "w3tests" branch.
I'll continue working on the remaining 4 test suites from W3C.

@robstewart57
Copy link
Owner

@cordawyn Wow, it looks like the effort to include W3C tests has been vindicated :-) 114 out of 291 failing tests leaves room for improvement.

RE: the 6Mb test data, It looks like we have two choices:

  1. We separate the tests out of a cabal project called rdf4h-tests , and maintain in a separate repository. The main advantage of this is that cabal install rdf4h doesn't pull in 6Mb test data. The disadvantage is that a user must know about the separation in order to run the tests, and also that rdf4h would no longer be self contained. This practise is used with the CloudHaskell code base:

https://github.com/haskell-distributed/distributed-process-tests
https://github.com/haskell-distributed/network-transport-tests

  1. In testsuite/tests/Test.hs before defaultMain, we create a data/w3c/ directory and download the W3C test files if the directory doesn't exist. Then, in defaultMain, we run all the W3C unit tests. The main advantage is that rdf4h would be self contained -- the library together with the test suite. The disadvantage is that test suites should not rely on any resources, i.e. an internet connection to obtain the W3C unit test files.

Which do you think we should opt for?

@andrewufrank
Copy link
Author

dear robert

i prefer the first route -

  1. i have multiple computers and do run tests only on one (if at all -
    often i just trust, not a good policy..). tests often pull in a large
    amount of data (as here) and sometimes introduce new dependencies (which
    in turn draw in others).
  2. some of my machines are ARM architectures (cubieboard/truck) and
    there not all of GHC is available (no templates), which are often used
    in tests. in such cases i have manually to cut out the test from the
    cabal files to conclude the installation.

i think a clear message in the cabal description pointing to the test
suite as a separate installationm is the most appropriate step.

thank you for effort to improve rdf4h!

andrew

Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
+43 1 58801 12710 direct
Geoinformation, TU Wien +43 1 58801 12700 office
Gusshausstr. 27-29 +43 1 55801 12799 fax
1040 Wien Austria +43 676 419 25 72 mobil

On 11/23/2014 10:21 PM, Rob Stewart wrote:

@cordawyn https://github.com/cordawyn Wow, it looks like the effort to
include W3C tests has been vindicated :-) 114 out of 291 failing tests
leaves room for improvement.

RE: the 6Mb test data, It looks like we have two choices:

  1. We separate the tests out of a cabal project called rdf4h-tests , and
    maintain in a separate repository. The main advantage of this is that
    |cabal install rdf4h| doesn't pull in 6Mb test data. The disadvantage is
    that a user must know about the separation in order to run the tests,
    and also that rdf4h would no longer be self contained. This practise is
    used with the CloudHaskell code base:

https://github.com/haskell-distributed/distributed-process-tests
https://github.com/haskell-distributed/network-transport-tests

  1. In |testsuite/tests/Test.hs| before |defaultMain|, we create a
    |data/w3c/| directory and download the W3C test files if the directory
    doesn't exist. Then, in |defaultMain|, we run all the W3C unit tests.
    The main advantage is that rdf4h would be self contained -- the library
    together with the test suite. The disadvantage is that test suites
    should not rely on any resources, i.e. an internet connection to obtain
    the W3C unit test files.

Which do you think we should opt for?


Reply to this email directly or view it on GitHub
#12 (comment).

@cordawyn
Copy link
Collaborator

@robstewart57 I don't like the idea of having W3C data files in the repository. I believe we need a dedicated task (or a shell script) to download it, regardless of how we arrange the test suite - together with the main project or separately.

As for the 1st choice that you suggest, it sounds to be in line with the policy of having -dev or -devel packages of Linux, which is probably a good thing. However, I'm afraid that the test suite will start dragging behind the main package as soon as they are separated. Because, you know -- humans ;-)

Do we have some "good practices" in Haskell to address these issues (test data files and separation of core and test packages)? I'll have to ask around.

@andrewufrank I think it is possible to avoid downloading and building test dependencies if you can run cabal as cabal install <package> --disable-tests, you don't need to edit the .cabal file.

@cordawyn
Copy link
Collaborator

@robstewart57 Breaking news! Our TurtleParser failed parsing http://www.w3.org/2013/rdf-mt-tests/manifest.ttl (RDF 1.1 Semantics). I'll create a separate issue for this (so that someone could fix it while I'm busy), and will move on to other W3C tests.

There are also (minor, so far) errors in "manifest.ttl" files, I'm sending the reports to W3C to fix them (if anyone has a quicker r/w access to those test suite files, please help ;-) ).

@robstewart57
Copy link
Owner

So now we're now propagating bug reports up to W3C now? Success!

@cordawyn
Copy link
Collaborator

All thanks to the "strictly typed power" of Haskell ;-)

Anyway, I'm somewhat done with the tests for now. We're running 2 out of 6 test suites from W3C: Turtle tests and RDF/XML tests. The remaining issues are:

  • TriG tests (not applicable to RDF4H)
  • N-Triples tests (we fail to parse their manifest.ttl)
  • N-Quads tests (we fail to parse their manifest.ttl too)
  • RDF Semantics (we parse manifest.ttl but build incorrect tree, see above)

RDF/XML test results are:

         Test Cases    Total
 Passed  40            40
 Failed  122           122
 Total   162           162

I added the parsing issues to our bug tracker. I cannot continue with Haskell part of the test suites until they're fixed, but I'll add a task to download W3C data files to our Makefile.

Btw, we should probably switch to a dedicated issue with this W3C test suite, before we go off-topic any further ;-) Here it is: #17

@cordawyn
Copy link
Collaborator

cordawyn commented Dec 3, 2014

I'm back on this original issue. So I'm going to add namespace expansion to select, query and triplesOf functions. Stay tuned.

@cordawyn
Copy link
Collaborator

@andrewufrank Several tests and trials later, we've come to the agreement that URI expansion (expanding prefixes and absolutizing relative URIs) and removing duplicates (if necessary) should be performed by the library users and will not be a part of graph implementations in RDF4H. We're going to export the functions necessary for that and update README with a note and examples.

Stay tuned for the next update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants