-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Crunch automation #395
Comments
I started working on this https://github.com/Crunch-io/scrunch/tree/dataset-scripts |
Great. In the docstring for |
Another question, is there any functional difference between reverting to a pre-script savepoint and reverting to any other savepoint? Are script savepoints simply a convenience around regular dataset savepoints that are being managed as part of the script execution process? |
We've seen some users of automation run several single command scripts, quickly accumulating hundreds of scripts. Each script has a tax cost of a dataset savepoint to allow for revert. So even if the single line script seems small, it takes a bit of storage space. The expectation is to perform most (ideally all) transformations in one script and go. The revert-repeat cycle is great. You run the script, if you don't like the results, you revert, adjust the script and re-run. Reverting a script deletes its entry, reverted attempts don't accumulate. The scripts list only contains successfully executed scripts.
Scripts generate a savepoint before executing. These savepoints are the same as all the savepoints in the system. If you go to your savepoints list and revert to a SP that was created by a script, you'll bring the world back to that point same as with other SP. Scripts are part of this world, so reverting also brings the scripts list back in time to that mark. Script entities do have a dedicated We are still testing these behaviors, they can be confusing so feedback on those will be helpful. Just to make this comment more complicated, Scripts also expose an I opted to only expose one of these implementations to exonerate users from this confusion. |
Ok thanks for the details, we'll make sure to include in training around this to minimize the number of automations where possible by collecting all adjacent actions into a single script. Say a project had 40 countries to update in the master dataset every month. For each update, for each country, a script would be used because the process would be: fork streaming>script>append to fork of master>mergeback. Does this mean the master script would end up with +38 scripts/savepoints each month? What if the update frequency increased to twice-monthly or even weekly (2-4x as many scripts per being used per month) - what does the extreme end of this look like from your POV? What if the project went on for 5 years? Essentially, can you anticipate a point at which we'd encounter an issue and what should we know/do from the beginning to mitigate any adverse effects? Is it possible to cleanup savepoints/executed scripts some time in the future when we know they're never going to be reverted again? |
I believe yes, a script execution gets recorded as an action execution, so when you merge the fork, such action (and all the steps the script performed) will get replayed. You will end up with all the scripts from all the forks. The fact that a dataset has hundreds of scripts executed isn't a problem on the system, but it's a problem on the user because on how to make sense of what's going on there with so many scripts there. In the master tracker I suppose they can be ignored, since nobody would be reverting/re-running scripts there. The scripts API provides a |
Thanks @jjdelc. It might not make sense to collapse these scripts because they couldn't be replayed all at once. Since automation scripts offer only intra-dataset functionality they will be punctuated by inter-dataset actions as directed by other parts of scrunch (e.g. forking, merging back, joining, appending, comparing, etc.). I suppose looking only at a universe of scripts housed in a dataset ignores these kinds of actions. |
@jamesrkg I need to make a correction, what I said above is incorrect. I ran some tests last night to make sure of how scripts worked on forks and merges. Turns out that the action that records the script creation does not get replayed when it gets merged back. But all the steps executed in the script will. So, you can proceed forking, scripting, merging and your scripts will only live in the forks. The main dataset will not keep record that those modifications happened via scripts. |
Thanks for confirming @jjdelc, I think overall that is a good thing. 👍 |
@jjdelc @malecki can we please get the features needed to manage Crunch automation from scrunch added soon? We have plans to use this heavily in the near future. In the scheduling/repetition of project processing we want to migrate all intra-dataset actions to automation scripts, however, we need to push these using scrunch because they will be the punctuation between inter-dataset actions (still performed with traditional scrunch calls).
The text was updated successfully, but these errors were encountered: