-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OwlLoadTask should always work #45
Comments
First, I want to confirm that I understand the problem you described.
If this corresponds to the problem you described, I can explain the rationale behind this behavior. When developing complex query/reduction workflows and the only changes are to The case you describe involves a situation where we only change the SPARQL endpoint. Unfortunately, there is no easy way to tell from the client side that we're working with a different SPARQL endpoint. There are two ways to deal with this:
See: https://docs.gradle.org/current/userguide/command_line_interface.html#sec:rerun_tasks
Unfortunately, checking whether a graph needs to be loaded is an expensive I/O operation.
With the second approach, Unless you can suggest a cheaper alternative to option (2), I recommend following option (1). The main caveat with the second approach is |
I think the task is too conservative and really doesn't work well with users that are not changing models, but instead may be solely working on reasoned data in the database. Instructing gradle to rerun tasks is overkill. We really don't want to run all the tasks from point 0 just to reload data. Moving away from load based on cache status should be easy. At worst, you issue a drop all command. That will clean out existing data from the database. Then load. At best, there is no resources spent on drop all. I would not worry about trying to validate what content is in the database. I agree that is a waste of time and bound to be wrong in some scenarios. I'm not sure that I follow your argument about iops as relates to query development. Every caesar workflow I've seen breaks queries apart from load. Thus, if you are iterating through a dev cycle, it's quite easy to churn through many queries without having to ever load. We do that today. The problem is if you need to load data and the model doesn't change, then the task is brain dead. Why do we need to load data? It's a different day, machines reboot, processes get killed, etc. The short is that a load should work independent of a somewhat suspicious cache state that relies on a changing model. Again, our workflow is centered on working with reasoned data that may arguably not change for days/weeks on end. We're exploring how to leverage that data but we're not in the business of creating/modifying any part of the source oml in that project. Our only recourse today is to throw away everything (i.e. similar to rerun) and force the entire task chain to fire just to reload the same data. This is expensive and a waste of time. |
In Gradle 7.6+, there is a (base) MT-319039:kepler16b-example elaasar$ ./gradlew clean BUILD SUCCESSFUL in 599ms (base) MT-319039:kepler16b-example elaasar$ ./gradlew owlQuery --console=verbose
BUILD SUCCESSFUL in 4s KILL FUSEKI and RESTART IT WITHOUT STOP/START FUSEKI (base) MT-319039:kepler16b-example elaasar$ ./gradlew owlLoad --console=verbose
BUILD SUCCESSFUL in 601ms (base) MT-319039:kepler16b-example elaasar$ ./gradlew owlLoad --console=verbose --rerun
BUILD SUCCESSFUL in 1s |
The above solution will work regardless of whether Fuseki started with start/stopFuseki. However, for the cases where start/stop Fusek is used, you can modify the owlLoad to force rerunning whever startFuseki reruns. |
What is OwlLoad optimizing for? I don't quite understand what use case is being facilitated with the current design. Let me address the suggestion to use --rerun, first. Gradle v7.6+ is not a requirement to use CAESAR as far as I know. Availability of newer Gradle command line options should not be used to overcome a design deficiency in a custom task. Secondly, how or who is the cache dependency helping? In the CI/CD case: Taking away the cache dependency would not hurt this case at all. In fact, removing this probably would help/simplify that workload. In the iterative query development case where the oml regularly changes: This make some sense but the clear assumption here is that the use case is centered on a changing model. In this case, cache dependency doesn't really help. In the iterative query development case where the model does not change: OwlLoad is quite broken. The difference between this case and the one above is time + lack of model changes. If the model doesn't change, then we're at the crux of the problem. Time, as I've described before, really comes into play as iterations on the db side of the process may require days/weeks of effort with the same reasoned data. A load task should load. That should be default. At best, the design should include a property that instructs the task that you would like consider an optional cache dependency but the default should not utilize it. In all my years of working with db projects, I've never worked with a toolset where "load" might not work b/c it was cache-centric. Such before behaviors are managed before the call to "load". The cache dependency should be removed as that will better fit all cases and actually make it easier to use custom gradle tasks. For example, if I want to use existing files and completely remove the dependsOn characteristics, the following should always work.
However, this won't suffice because of the additional "smarts" that task attempts to enforce. Users that don't alter models shouldn't be forced to run the same tasks over and over. Some builds take the better part of a 15-20 minutes, rely on vpn connections, etc. The use case here is that project has been built and all the reasoned data is sitting there waiting to be deserialized. I would recommend simplifying this task.
If the decision is to leave this as-is, let me and I will close this out. |
Several changes are proposed to OwlLoad by this PR to address the above issue and improve it in other ways: Incrementality
Performance
Others
|
@mprather do you like to checkout branch |
Description
OML gradle tasks attempt to use caching to prevent repeated work; this is desired in most cases. OwlLoadTask can at times do nothing. This is problematic if you are using an in-memory database and you want to load data. The task will often times look at its dependencies, determine no change was made, and then skip the actual load.
Steps to Reproduce
The repro steps are scenario based.
Interim Result: everything works as planned/expected
Final Result: Nothing was loaded.
Expected Behavior
OwlLoadTask should not have logic that prevents it from calling the actual database load commands. When developing SPARQL queries or items that are database-centric, it is common for the project to remain static wrt vocabularies and descriptions. This means we may go through long periods of time where the task sees no changes and hence will not load a database.
The text was updated successfully, but these errors were encountered: