Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OwlLoadTask should always work #45

Open
mprather opened this issue May 10, 2023 · 7 comments
Open

OwlLoadTask should always work #45

mprather opened this issue May 10, 2023 · 7 comments

Comments

@mprather
Copy link

Description

OML gradle tasks attempt to use caching to prevent repeated work; this is desired in most cases. OwlLoadTask can at times do nothing. This is problematic if you are using an in-memory database and you want to load data. The task will often times look at its dependencies, determine no change was made, and then skip the actual load.

Steps to Reproduce

The repro steps are scenario based.

  1. Make a change to an OML project.
  2. Setup a fuseki server to use a memory store (this could be an instance all by itself or the fuseki server that is typically managed by StartFusekiTask.
  3. Run tasks to build, load, and run a query.

Interim Result: everything works as planned/expected

  1. Stop the server. Since there are many different ways to do this, let's assume the server is not stopped by using StopFusekiTask.
  2. Start the server
  3. Run OwlLoadTask

Final Result: Nothing was loaded.

Expected Behavior

OwlLoadTask should not have logic that prevents it from calling the actual database load commands. When developing SPARQL queries or items that are database-centric, it is common for the project to remain static wrt vocabularies and descriptions. This means we may go through long periods of time where the task sees no changes and hence will not load a database.

@NicolasRouquette
Copy link
Member

First, I want to confirm that I understand the problem you described.

  1. There is a SPARQL endpoint with no data loaded.
  2. Run OwlLoadTask: it loads the OWL data as expected.
  3. Restart the SPARQL endpoint; again, with no data loaded.
  4. Run OwlLoadTask again: it does not load anything.

If this corresponds to the problem you described, I can explain the rationale behind this behavior.

When developing complex query/reduction workflows and the only changes are to *.sparql files, we wanted to run a Gradle workflow and skip doing unnecessary work when the *.oml files have not changed, including loading the converted OWL files. To achieve this behavior, the Gradle OwlLoadTask uses Gradle's input caching mechanism to skip running the task if the inputs have not changed. This caching strategy is very effective as long as we keep the same SPARQL endpoint across multiple Gradle runs.

The case you describe involves a situation where we only change the SPARQL endpoint. Unfortunately, there is no easy way to tell from the client side that we're working with a different SPARQL endpoint.

There are two ways to deal with this:

  1. Tell Gradle to force executing the OwlLoadTask even if caching suggests skipping the task.

See: https://docs.gradle.org/current/userguide/command_line_interface.html#sec:rerun_tasks

  1. Change the logic of the OwlLoadTask so that it does not use Gradle's input caching mechanism and instead checks with the SPARQL endpoint if a graph needs to be loaded.

Unfortunately, checking whether a graph needs to be loaded is an expensive I/O operation.

  • In the simple case, the graph IRI does not exist; the graph needs to be loaded.
  • In the expensive case, the graph IRI exists and since we no longer use Gradle's input caching mechanism, we either need to delete/reload the graph or do something fancy like query the graph contents for a normalized representation, compute a checksum, and load the graph if the checksum of the local data does not match the checksum computed from querying the server.

With the second approach, OwlLoadTask becomes an expensive I/O operation. In the use case where we want to work quickly on elaborating queries/reductions, we'd have to pay a significant price in computing a bunch of server graph checksums to realize that we do not need to load the data. Unfortunately, this calculation effectively requires transmitting all graphs. So, if we have, e.g., 1Gb of data on the server, each time we run Gradle, we have to transfer 1Gb of data to compute these checksums only to realize that the server has the same data as what we have locally.

Unless you can suggest a cheaper alternative to option (2), I recommend following option (1).

The main caveat with the second approach is

@mprather
Copy link
Author

mprather commented Oct 20, 2023

I think the task is too conservative and really doesn't work well with users that are not changing models, but instead may be solely working on reasoned data in the database. Instructing gradle to rerun tasks is overkill. We really don't want to run all the tasks from point 0 just to reload data.

Moving away from load based on cache status should be easy. At worst, you issue a drop all command. That will clean out existing data from the database. Then load. At best, there is no resources spent on drop all. I would not worry about trying to validate what content is in the database. I agree that is a waste of time and bound to be wrong in some scenarios.

I'm not sure that I follow your argument about iops as relates to query development. Every caesar workflow I've seen breaks queries apart from load. Thus, if you are iterating through a dev cycle, it's quite easy to churn through many queries without having to ever load. We do that today. The problem is if you need to load data and the model doesn't change, then the task is brain dead. Why do we need to load data? It's a different day, machines reboot, processes get killed, etc. The short is that a load should work independent of a somewhat suspicious cache state that relies on a changing model.

Again, our workflow is centered on working with reasoned data that may arguably not change for days/weeks on end. We're exploring how to leverage that data but we're not in the business of creating/modifying any part of the source oml in that project. Our only recourse today is to throw away everything (i.e. similar to rerun) and force the entire task chain to fire just to reload the same data. This is expensive and a waste of time.

@melaasar
Copy link
Member

melaasar commented Oct 23, 2023

In Gradle 7.6+, there is a --rerun flag you can pass to force the tasks explicitly named when running gradle to rerun without forcing their dependencies to also rerun. See below for the scenario (I used --console=verbose to see whether the taks ran or was up to date).


(base) MT-319039:kepler16b-example elaasar$ ./gradlew clean

BUILD SUCCESSFUL in 599ms
1 actionable task: 1 up-to-date


(base) MT-319039:kepler16b-example elaasar$ ./gradlew owlQuery --console=verbose

Task :downloadDependencies
Task :omlToOwl
Task :owlReason
Task :startFuseki
Fuseki server has now successfully started with pid=15860, listening on http://localhost:3030
Task :owlLoad
Task :owlQuery

BUILD SUCCESSFUL in 4s
6 actionable tasks: 6 executed


KILL FUSEKI and RESTART IT WITHOUT STOP/START FUSEKI


(base) MT-319039:kepler16b-example elaasar$ ./gradlew owlLoad --console=verbose

Task :downloadDependencies UP-TO-DATE
Task :omlToOwl UP-TO-DATE
Task :owlReason UP-TO-DATE
Task :startFuseki UP-TO-DATE
Task :owlLoad UP-TO-DATE

BUILD SUCCESSFUL in 601ms
5 actionable tasks: 5 up-to-date


(base) MT-319039:kepler16b-example elaasar$ ./gradlew owlLoad --console=verbose --rerun

Task :downloadDependencies UP-TO-DATE
Task :omlToOwl UP-TO-DATE
Task :owlReason UP-TO-DATE
Task :startFuseki UP-TO-DATE
Task :owlLoad

BUILD SUCCESSFUL in 1s
5 actionable tasks: 1 executed, 4 up-to-date

@melaasar
Copy link
Member

The above solution will work regardless of whether Fuseki started with start/stopFuseki. However, for the cases where start/stop Fusek is used, you can modify the owlLoad to force rerunning whever startFuseki reruns.

https://github.com/opencaesar/oml-template/blob/3086e4af076587680a02f93cbe107c8dc7d87dff/build.gradle#L140

@mprather
Copy link
Author

What is OwlLoad optimizing for? I don't quite understand what use case is being facilitated with the current design.

Let me address the suggestion to use --rerun, first. Gradle v7.6+ is not a requirement to use CAESAR as far as I know. Availability of newer Gradle command line options should not be used to overcome a design deficiency in a custom task.

Secondly, how or who is the cache dependency helping?

In the CI/CD case: Taking away the cache dependency would not hurt this case at all. In fact, removing this probably would help/simplify that workload.

In the iterative query development case where the oml regularly changes: This make some sense but the clear assumption here is that the use case is centered on a changing model. In this case, cache dependency doesn't really help.

In the iterative query development case where the model does not change: OwlLoad is quite broken. The difference between this case and the one above is time + lack of model changes. If the model doesn't change, then we're at the crux of the problem. Time, as I've described before, really comes into play as iterations on the db side of the process may require days/weeks of effort with the same reasoned data.

A load task should load. That should be default. At best, the design should include a property that instructs the task that you would like consider an optional cache dependency but the default should not utilize it. In all my years of working with db projects, I've never worked with a toolset where "load" might not work b/c it was cache-centric. Such before behaviors are managed before the call to "load".

The cache dependency should be removed as that will better fit all cases and actually make it easier to use custom gradle tasks. For example, if I want to use existing files and completely remove the dependsOn characteristics, the following should always work.

task JustLoadIt (type:io.opencaesar.owl.load.OwlLoadTask) {
  // all other properties remain the same as the version that utilizes a typical dependsOn: reasonerTask
}

However, this won't suffice because of the additional "smarts" that task attempts to enforce. Users that don't alter models shouldn't be forced to run the same tasks over and over. Some builds take the better part of a 15-20 minutes, rely on vpn connections, etc. The use case here is that project has been built and all the reasoned data is sitting there waiting to be deserialized.

I would recommend simplifying this task.

  1. The "load" task should always load. Remove the caching behavior. If the files are available, then the task should load the database. If not, then signal that upstream tasks should fire.
  2. Don't rely on a set of command line options from a tool version that is not required.
  3. Don't assume that start and stop fuseki are used (our team never uses those tasks nowadays because of a seemingly fragile nature on dev machines and past quirky interactions with load) and therefore the load task can be cleanly separated from those tasks.
  4. Optionally add a property to enable cachiness if there is a use case that absolutely needs it. This option should not be on by default.

If the decision is to leave this as-is, let me and I will close this out.

@melaasar
Copy link
Member

melaasar commented Nov 7, 2023

Several changes are proposed to OwlLoad by this PR to address the above issue and improve it in other ways:

Incrementality

  • The task has been changed to non-incremental by default. This means it will (re)load all named graphs in the dataset and will delete any ones that should not be there.
  • Added a new boolean flag incremental which can be set to true in the gradle task to request incremental behavior, meaning to limit the named graphs that are processed to those deemed changed by Gradle (based on timestamp). It will reload them ionly f they are not already loaded. This should be faster than non-incremental.

Performance

  • Added a new parameter irisPath as an alternative to the exiting iris parameter. While the iris allows providing the root iris of the dataset, it forces loading them in memory to determine their import closure. The new irisPath parameter specifies a path to iris.txt file that contains a list of all iris (in the import closure) already computed. This should avoid loading the ontologies in memory by owlload (Fuseki will still load the ontologies uploaded to it).
  • The iris.txt file can be generated by a new parameter on the owlReason task called outputOntologyIrisPath.
  • Owlload's incremental flag will be set to true, and the irisPath parameter will be configured, in the oml-template repo, all existing vocabulary/example repos in openCAESAR, and in new project wizard in Rosetta.
  • Reimplemented how OwlLoad queries loaded graphs and count their triples to avoid downloading the dataset from Fuseki.

Others

  • Added a queryService param to both OwlLoad and OwlQuery (with a default of sparql), and a shaclServiceto OwlShacle (with a default of shacl), to allow setting the short names of the query/shacl services to vales that match those specified in the .fuseki.ttl` file.
  • Added username and password parameters to OwlLoad to allow loading to endpoints protected by credentials. These new parameters are names of env vars that have the credentials values.
  • Added loadToDefaultGraph parameter to OwlLoad to load the named graphs to the default graph instead of to correspondingly named graphs. Implemented a workaround to do this for quad graphs due to a lack of Fuseki aPI for that.

@melaasar
Copy link
Member

melaasar commented Nov 7, 2023

@mprather do you like to checkout branch issue-55, build it locally and publish to Maven local, and try out if it addresses your original issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants