Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move MemoryOverflowModel and SailSourceModel to rdf4j-sail-base #4555

Open
kenwenzel opened this issue May 16, 2023 · 7 comments
Open

Move MemoryOverflowModel and SailSourceModel to rdf4j-sail-base #4555

kenwenzel opened this issue May 16, 2023 · 7 comments
Labels
📶 enhancement issue is a new feature or improvement

Comments

@kenwenzel
Copy link
Contributor

Problem description

Both, NativeStore and LmdbStore, use the classes MemoryOverflowModel and SailSourceModel with own implementations.

Preferred solution

Move MemoryOverflowModel and SailSourceModel to rdf4j-sail-base.

Are you interested in contributing a solution yourself?

None

Alternatives you've considered

No response

Anything else?

No response

@kenwenzel kenwenzel added the 📶 enhancement issue is a new feature or improvement label May 16, 2023
@github-project-automation github-project-automation bot moved this to 🆕 Triage in RDF4J Planning May 16, 2023
@hmottestad
Copy link
Contributor

We should make sure that we don't break backwards compatibility.

@abrokenjester
Copy link
Contributor

I personally think we should get rid of SailSourceModel entirely, as it has quite poor performance, and replace with a MapDB-backed model implementation.

@kenwenzel
Copy link
Contributor Author

I am not sure if MapDB will help here as we need multiple triple indexes and also some kind of value store. In the end we will be forced to implement something like NativeStore or LmdbStore.

@hmottestad
Copy link
Contributor

I'm not sure if we need all that much safety wise with the MemoryOverflowModel. We could overflow to a n-quads file. The ShaclSail wouldn't enjoy the performance hit though, since it would have to query the data in the current transaction which would mean reading from the n-quads file.

@kenwenzel
Copy link
Contributor Author

But this would require a linear scan to match different triple patterns like (S, _, ), (, P, ) or (, _, O).
As overflow is only triggered for large transactions it should be expected that the n-quads file is also rather large.
We could also experiment with Parquet files that support faster scanning by skipping row groups.

@hmottestad
Copy link
Contributor

hmottestad commented May 17, 2023

True, which is why it would be bad for the ShaclSail. In general I would assume that users who load in large files don't typically run queries before committing. We could try something like the DynamicModel, where we use a very simple and fast overflow until the user runs queries at which point we would migrate the data to a more advanced data structure.

@kenwenzel
Copy link
Contributor Author

@hmottestad I think we could improve the current situation:
#4557

Maybe you can help with the logic for switching to disk-based storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📶 enhancement issue is a new feature or improvement
Projects
None yet
Development

No branches or pull requests

3 participants