Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #56

Open
wismill opened this issue Jul 23, 2018 · 4 comments
Open

Improve performance #56

wismill opened this issue Jul 23, 2018 · 4 comments

Comments

@wismill
Copy link
Contributor

wismill commented Jul 23, 2018

Now that rdf4h has a complete support for NTriples & Turtle, it may be a good time to focus on performance:

  • Parsers
  • Graph implementation
  • IRI handling

As mentioned in #35 and #44, there are several places where we could improve the parsers. I think it would be a good idea to keep only one modern parser library (attoparsec or megaparsec) to keep the implementation simple and make it more efficient.

I think the handling of prefixes in UNode is not satisfying. For instance, several important operations require expandTriples which is very expensive. I propose that we remove expandTriples and make use of a smart constructor unode :: Text -> Either IRIError UNode for IRI (currently merely a constructor synonym) that ensure the IRI is a valid absolute IRI. Then have a function mkIRI that accept a namespace (or a prefix mapping using a new type class) to create IRIs, from a relative IRI or a prefixed IRI (see expandURI).

Edit: change the proposed signature of unode to use Either rather than Maybe.

@wismill
Copy link
Contributor Author

wismill commented May 22, 2019

@robstewart57 what do you think about this proposal?

I would like to keep only Megaparsec for the parsing, as it is now very fast and robust.

@robstewart57
Copy link
Owner

@wismill the idea of #36 was to generalise rdf4h across numerous parsers, specificlaly attoparsec and parsec. E.g.

-- |'NTriplesParser' is an instance of 'RdfParser' using parsec based parsers.
instance RdfParser NTriplesParser where
  parseString _  = parseStringParsec
  parseFile   _  = parseFileParsec
  parseURL    _  = parseURLParsec

-- |'NTriplesParser' is an instance of 'RdfParser'.
instance RdfParser NTriplesParserCustom where
  parseString (NTriplesParserCustom Parsec)     = parseStringParsec
  parseString (NTriplesParserCustom Attoparsec) = parseStringAttoparsec
  parseFile   (NTriplesParserCustom Parsec)     = parseFileParsec
  parseFile   (NTriplesParserCustom Attoparsec) = parseFileAttoparsec
  parseURL    (NTriplesParserCustom Parsec)     = parseURLParsec
  parseURL    (NTriplesParserCustom Attoparsec) = parseURLAttoparsec

-- |'NTriplesParser' is an instance of 'RdfParser' using parsec based parsers.
.

This functionality comes from this PR: #36

The motivation of this flexibility between parsers is that each of they have trade offs. E.g. "Megaparsec vs Attoparsec" : https://github.com/mrkkrp/megaparsec#megaparsec-vs-attoparsec

Attoparsec is sometimes faster but not that feature-rich. It should be used when you want to process large amounts of data where performance matters more than quality of error messages.

This is a realistic assumption when working with RDF data, which might be 10s/100s MegaBytes or millions of triples.

There's megaparsec instances in parsers-megaparsec that we could use: https://hackage.haskell.org/package/parsers-megaparsec

The issue arises when attoparsec, parsec and megaparsec instances of the typeclasses in the parsers have different parsing semantics (which presumably shouldn't happen), meaning the rdf4h tests against w2c-tests might pass for one instance of the parsers typeclasses, e.g. parsec, but fail for others, megaparsec.

@wismill
Copy link
Contributor Author

wismill commented May 31, 2019

Ok, but at least we could drop parsec and keep only attoparsec and megaparsec. Then we could use parser-combinators.

I think we should also provide a way to stream parsing results. As there are several framework for this, I wonder if we should provide new packages such as: rdf4h-pipes, rdf4h-conduit, rdf4h-streamly, etc.

@robstewart57
Copy link
Owner

@wismill

I think we should also provide a way to stream parsing results. As there are several framework for this, I wonder if we should provide new packages such as: rdf4h-pipes, rdf4h-conduit, rdf4h-streamly, etc.

I would also really like to see this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants