A Python tool to describe and run tools and workflows for processing SPARC datasets in accordance with FAIR principles.
- About
- Introduction
- The problem
- Our solution - sparc-flow
- Impact and vision
- Future developments
- Setting up sparc-flow
- Using sparc-flow
- Reporting issues
- Contributing
- Cite us
- FAIR practices
- License
- Team
- Acknowledgements
This is the repository of Team sparc-flow (Team #3) of the 2023 SPARC Codeathon. Click here to find out more about the SPARC Codeathon 2023. Check out the Team Section of this page to find out more about our team members.
No work was done on this project prior to the Codeathon.
The NIH Common Fund program on Stimulating Peripheral Activity to Relieve Conditions (SPARC) focuses on understanding peripheral nerves (nerves that connect the brain and spinal cord to the rest of the body), how their electrical signals control internal organ function, and how therapeutic devices could be developed to modulate electrical activity in nerves to improve organ function. This may provide a potentially powerful way to treat a diverse set of common conditions and diseases such hypertension, heart failure, gastrointestinal disorders, and more. 60 research groups spanning 90 institutions and companies contribute to SPARC and work across over 15 organs and systems in 8 species.
The SPARC Portal provides a single user-facing online interface to resources that can be shared, cited, visualized, computed, and used for virtual experimentation. A key offering of the portal is the collection of well-curated datasets in a standardised format, including anatomical and computational models that are being generated both SPARC-funded researchers and the international scientific community. These datasets can be found under the "Find Data" section of the SPARC Portal. Information regarding how to navigate a SPARC dataset and how a dataset is formatted can be found on the SPARC Portal.
Workflows can be developed that apply tools (e.g. segmentation of images, or running of computational physiology simulations) in a series of steps to process the original data and generate new results, outcomes, and knowledge. These results (derived data) can be stored in a new standardised dataset and potentially be contributed to the SPARC Portal to support further scientific advances.
There is currently no option for users to:
- easily describe workflows and tools, which process SPARC data, in a FAIR manner
- easily run such workflows locally or from cloud computing platforms such as oSPARC
- easily reproduce workflow results
- reuse tools developed for processing SPARC data in new workflows (tools are currently bundled within and tailored to specific SPARC datasets).
To address this problem, we have developed a Python module called the SPARC Flow (sparc-flow) that can be used to describe tools and workflows for processing SPARC datasets in accordance with FAIR principles:
- Provides an easy-to-use python-based application programming interface (API) to enable tools and workflows to be described in a language agnostic manner.
- Enables the parameters used for running workflows to be stored with the standardised workflow description along with a copy of its associated tools to enable workflow results to be easily reproduced.
- Enables workflows and tool descriptions to be independently stored in SDS datasets, ready to be contributed to the SPARC portal to enable reuse by others.
- Provides the ability to save and load workflows and tools directly from/to SDS datasets via sparc-me.
- Provides the ability to run workflows:
- locally;
- on existing cloud computing platforms such as oSPARC; or
- help prepare the workflow to be submitted to Dockstore to enable using its standardised workflow interfaces to run them directly from the commandline or through existing cloud computing platforms from Dockstore.org (currently supports running on AnVIL, Cavatica, CGC, DNAnexus, Galaxy, Nextflow Tower, and Terra).
- Provides tutorials that demonstrate each of the above features.
- Proposes guidelines for FAIR-use of tools and workflows
- Provides best practices guidance in tutorials on how to use these guidelines.
If you find sparc-flow useful, please add a GitHub Star to support developments!
The sparc-flow API has been designed to be agnostic to the language used to describe tools & workflows and the services it adopts to run the workflows. The following languages and services are currently supported:
- The Common Workflow Language (CWL) - is an open standard and specification used in the field of bioinformatics and scientific computing to describe and execute workflows. CWL provides a way to define and share complex computational tasks and data processing pipelines in a portable and platform-independent manner. It uses a JSON-based format to describe input data, processing steps, and output data, allowing researchers to collaborate and share reproducible analyses across different computing environments. CWL aims to enhance the ease of defining, sharing, and executing computational workflows, particularly in the context of data-intensive scientific research.
- Dockstore - is an open platform used for sharing, publishing, and discovering bioinformatics tools and workflows. It allows researchers and scientists to easily find, collaborate on, and reproduce analyses involving complex data processing pipelines. Dockstore provides a standardized way to describe and share tools and workflows using CWL and Workflow Description Language (WDL). It facilitates reproducibility in bioinformatics research by enabling users to access and execute these tools and workflows in various computational environments, such as cloud platforms, containers (e.g. Docker), or local clusters. Dockstore is Supported by 58 organisations including the Global Alliance for Genomics and Health (GA4GH), the Broad Institute, the Human Cell Atlas, the Human BioMolecular Atlas Program (HuBMAP), NIH Cloud Platform Interoperability Effort, the Imaging Data Commons, and Biosimulators.
We have compared FAIR-use guidelines for data and research software, and based on the literature, we have proposed guidelines for enabling FAIR workflows. Furthermore, we have also provided examples of how the technologies used in sparc-flow apply these guidelines.
sparc-flow will elevate the impact of the SPARC program by providing the fundamental tools needed by users to describe the tools and workflows they are building/using with SPARC data for generating novel results, outcomes, and knowledge. The breadth of impact spans across:
- Supporting SPARC Data and Resource Centre (DRC) and communnity developments including:
- sparc-flow automatcially generates SDS datasets for workflows and tools that could be submitted to a "worfklow" and "tool" section of the SPARC portal's "Find data and models" page.
- Improving efficiency of software developments (e.g. future codeathons and SPARC portal roadmap developments) by reducing the need to reimplement common functions.
- Supporting and promoting harmonisation/interoperability with other research initiatives. For example, sparc-flow enables running workflows on different platforms including those being developed in other NIH-funded initiatives such as the Common Fund’s NIH Data Commons program. This contributes to the developers vision for enabling workflows and to be described in a platform-agnostic manner to increase the accessibility to services provided by these platforms. For example, users could send and run their workflows and tools to platforms that:
- restrict access to datasets to specific territories to adhere to data-sovereignty requirements.
- have large-scale HPC facilities that are not available in their country.
- Supporting reuse of tools created by users for developing novel workflows without expending limited resources in re-inventing the wheel.
Ultimately, our vision is to include standardised workflow and tool descriptions in knowledge bases to support automated assembly and execution of workflows (e.g. for creating digital tiwns for precision medicine applications).
- Automate generate of API documentation.
- support for WDL, Nextflow, and Galaxy workflow languages that are used in scientific research platforms.
- integrating workflow and tool validators and checkers.
- integrating workflow and tool descriptions into knowledge graphs such as SCKAN to support the identification of workflow and tools that are related to specific biological concepts.
- incorporating approaches for automatically assessing adherence to FAIR-ness guidelines for workflows and tools.
- tagging workflows using e.g., Software Ontology (SWO) descriptions, that will make it easy to identify and search for workflows with e.g. specific license restrictions.
- Git
- Python. Tested on:
- 3.9
- Operating system. Tested on:
- Ubuntu 18
Here is the link to our project on PyPI
pip install sparc-flow
Clone the sparc-flow repository from github, e.g.:
git clone https://github.com/SPARC-FAIR-Codeathon/2023-team-3.git
-
Setting up a virtual environment (optional but recommended). In this step, we will create a virtual environment in a new folder named venv, and activate the virtual environment.
- Linux
python3 -m venv venv source venv/bin/activate
- Windows
python3 -m venv venv venv\Scripts\activate
-
Installing dependencies via pip
pip install -r requirements.txt
Guided Jupyter Notebook tutorials have been developed describing how to use sparc-flow in different scenarios:
Tutorial | Description |
---|---|
1 | Provides a typical data processing example that downloads an existing curated SDS dataset from the SPARC portal (Electrode design characterization for electrophysiology from swine peripheral nervous system) using sparc-me and perform post-processing to generate a new derived SDS dataset. This example will be used in subsequent tutorials |
2 | Use sparc-flow to programmatically describe the example in Tutorial 1 in a standard workflow language (Common Workflow Language). This tutorial incorporates best practice guidelines to ensure tools used in the workflow and the workflow itself are FAIR. |
3 | Use sparc-flow to run the standardised workflow described in Tutorial 2 locally using cwltool (reference implementation provided by the CWL Organisation). |
4 | Use sparc-flow to run the standardised workflow described in Tutorial 2 locally using Dockstore. |
5 | Use sparc-flow to run the standardised workflow described in Tutorial 2 via the cloud using a Dockstore-compatible cloud computing platform (e.g. AnVIL, Cavatica, CGC, DNAnexus, Galaxy, Nextflow Tower, and Terra). |
6 | Use sparc-flow to run the standardised workflow described in Tutorial 2 on oSPARC. |
7 | Use sparc-flow to run the standardised workflow described in Tutorial 2 on the 12 Labours Digital Twin Platform (To be completed in future developments). |
To report an issue or suggest a new feature, please use the issues page. Issue templates are provided to allow users to report bugs, and documentation or feature requests. Please check existing issues before submitting a new one.
Fork this repository and submit a pull request to contribute. Before doing so, please read our Code of Conduct and Contributing Guidelines. Pull request templates are provided to help guide developers in describing their contribution, mentioning the issues related to the pull request and describing their testing environment.
/sparc_flow/
- Parent directory of sparc-flow python module./sparc_flow/core/
- Core classes of sparc-flow./resources/
- Resources that are used in tutorials (e.g. SDS datasets containing workflow and tool descriptions)./tutorials/
- Parent directory of tutorials for using sparc-flow./development_examples/
- Parent directory of examples that were created during the development of sparc-flow./docs/images/
- Images used in sparc-flow tutorials.
If you use sparc-flow to make new discoveries or use the source code, please cite us as follows:
Jiali Xu, Linkun Gao, Michael Hoffman, Matthew French, Thiranja Prasad Babarenda Gamage, Chinchien Lin (2023). sparc-flow: v1.0.0 - A Python tool to create tools and workflows for processing SPARC datasets in accordance with FAIR principles.
Zenodo. https://doi.org/XXXX/zenodo.XXXX.
We have assessed the FAIRness of our sparc-flow tool against the FAIR Principles established for research software. The details are available in the following document.
sparc-flow is fully open source and distributed under the very permissive Apache License 2.0. See LICENSE for more information.
- Jiali Xu (Developer, Writer - Documentation)
- Linkun Gao (Developer, Writer - Documentation)
- Michael Hoffman (Developer, Writer - Documentation)
- Matthew French (Developer, Writer - Documentation)
- Thiranja Prasad Babarenda Gamage (Writer - Documentation)
- Chinchien Lin (Lead, SysAdmin)
- We would like to thank the organizers of the 2023 SPARC Codeathon for their guidance and support during this Codeathon.