-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nextflow spec questions #241
Comments
The Nextflow scripts uploaded by the user will not contain program code. Things that have to be solved with program code should be implemented separately and provided by core as modules so the user can just call them in the Nextflow script.
The Nextflow scripts will not have any path details inside. Required paths (i.e., mets path) will be passed as parameters to the Nextflow scripts. The user should just place a parameter placeholder in the script. The
Same as for
The validation should be done based on what is called or used. I think there isn't a clear definition of what we should allow and not allow inside Nextflow scripts, is it? The Nextflow script validator should be written based on what is allowed or not.
To which spec exactly are you referring?
It is not - was clarified above.
It is enough to use strings, i.e.,
The handling of the METS is done by the ocr-d processors called inside the Nextflow script. There are no explicit checks by the Nextflow scripts written so far based on produced output files.
I think you are referring to a specific Nextflow script, could you share which? Then I can provide a better answer.
With the
In order to share previous steps, this should be done in the Nextflow script itself. The main workflow can contain sub-workflows and decide which sub-workflow to execute based on some parameter.
I have not used Tower before because its free plan is limited. There is no other way to monitor job status than querying the logs. Every time a nextflow script is executed - cache folder, process logs, process folders, etc. are created.
This is the final report which contains useful data about the workflow run. It is produced based on the created artifacts in my answer above.
Every process inside the workflow and the workflow itself has a unique ID. This is how Nextflow knows what and where to resume based on where the workflow has failed. To be more precise - it restarts the last failed process.
If you say it is easier to write the caching of workflow steps, keep separate logs of everything, produce execution reports, do the programming in the shell rather than in Groovy, limit resource usage for specific processes, multitasking, and the HPC-related interactions are easier in shell scripts than in Nextflow - then I don't see other benefits as well. You can still use your available shell scripts as a separate process in Nextflow, and use Nextflow as a higher-level entry point. |
Understood. Nice! (This should at least be mentioned in the current spec!)
Ideally this parameter is passed by default in the included module scripts. The workflow definition should not require dealing with METS path (even as placeholder / variable).
Ok, great.
I don't know what's possible within NF, but we used to have So if you understand the Workflow Format as "anything Nextflow allows", then indeed there is not much you can validate beforehand. But the original idea was to have a strict format (only processor calls in a "chain") that can be checked.
To the current state of https://ocr-d.de/en/spec/nextflow
IIUC, at least for the CLI-based workflows currently implemented and in the spec, NF checks only for the processors exit status and its file artifacts. File artifacts in this case are merely the fileGrp directories' existence. So if the processor failed, or after a partial run on some pages, NF "thinks" the result for that step is already complete (because it has no notion of the actual METS fileGrps).
I'm talking of the example given in the spec, and also used in the Quiver workflows. It's tedious to give each step a new name. In contrast, examples in the NF documentation often use the
Ah, understood. Sounds great, but see – this is why it's problematic that NF does not get to know anything about the METS.
I know this is possible, it's nice to have, but in general I disagree: Look at it from the user's perspective: They can see workspaces (with some fileGrps) and workflows (with some fileGrps). Naturally, when they send processing jobs, they assume that existing fileGrps are skipped (and some error handling in case there's an actual conflict, i.e. same names but different processors/parameters).
I see.
By caller I did not mean NF, but the caller of NF. Sounds like everything is a black box. |
The spec and the examples inside are greatly outdated - so I understand now why were the questions above numbered from 1 to 6 were raised.
Understood.
Yes, that was the original idea - to have a strict format that is then translated into the Nextflow script. That's the reason why the workflow format description was started in #208. If you follow the discussions there - that idea was then dropped because we agreed on Nextflow scripts (
I agree that process names should be simplified to just
Yes, but the examples in the NF documentation are greatly limited and deal just with strings, integers, and single channels... When the result of a process has to be passed to the next process, it's hardly possible to use pipes without making the script hard for reading and understanding. Also, the input/output dependency of processes defines their execution order. In order to chain processes successfully, even if it's a dummy variable, something has to be passed from the out of the previous process to the input of the next process to prevent them from running in parallel.
Yes, I see it's problematic that NF knows nothing about the content of the Mets. However, searching inside the Mets file or checking paths for existing files feels too low-level to be dealt with that in the NF script itself. Ideally, inside the workflow description, there should be API calls for doing the mentioned steps. NF will then manage those steps.
This again goes in the direction of my comment just above this one. The problem I have here is that there isn't a defined and clear path on what are potential errors, how to handle those errors, what are users' expectations for error handling, etc, in order to address them appropriately.
That's the workflow server then - right, to continue execution of a failed workflow, the workflow server should keep track of the Nextflow script name, and potentially the unique ID of that workflow run. Robustness and rerunning workflows were not considered thoroughly enough. |
BTW, the NF resume docs state that:
IIUC this means that we must effectively manage NF working directories explicitly (via temporary directories created and removed by the workflow server).
Yes, that sounds prudent. NF just needs to know whether or not a step succeeded (whatever that step is) or should be repeated. No file artifacts ideally (so no
Indeed. And that needs to be discussed for the lower levels first and foremost, i.e. on the Python API:
We could use NF workflow introspection in our module scripts to talk to the outside consumers (i.e. workflow server or processing server or whoever called NF). For example, update the MongoDB with the But somehow it must be possible to manage and monitor NF jobs from outside. See https://github.com/SciDAS/nextflow-api |
Not really. I think the documentation pays attention to cases where you have two scripts in the same directory and they are used to start workflow jobs concurrently with each other. Unless some wacky parallelization needs to be achieved that I currently cannot think of an example of such, this is not a problem from the Workflow Server's perspective. In the current state of the WebAPI impl, each Nextflow script uploaded through the Workflow Server gets stored under a separate directory named
Yes, exactly.
Yes. Nextflow itself cannot magically fix things unless it knows what options are coming from the lower levels. So error handling goes from low to high level, I see 5 different levels: OCR-D processor -> Processing Worker -> Processing Server -> Nextflow Manager -> Workflow Server (i.e., the reverse of the call order)
Sure, but we should also be aware of potential problems related to the handlers. I remember that at some point I had the described issue as well. That's why I would rather use a module and report inside the process right before the process finishes - just to be on the safe side.
That's a good idea - to have a look at what useful runtime metadata we can store additionally in the DB. |
When we had the original discussions about a new workflow format to replace the de-facto standard
ocrd process
syntax in the core implementation, there was a general understanding that the spec must not fall short of the following features (met by the implementation):However, as it stands, the Workflow Format spec does not seem to meet these criteria. It raises questions …
venv
(and even de/activate it) and theworkspace
/mets
path in NF?reads
andouts
formulated as absolute paths in NF (instead of just fileGrp names)?-with-tower
or-with-report
arguments? If so, how does the caller get to know which job is which during runtime?)I understand that you tried to apply Nextflow to the OCR-D CLI directly. But currently I don't see a benefit over running the shell scripts directly (from a custom Workflow executor in core).
The text was updated successfully, but these errors were encountered: