This is an automated snakemake pipeline for NEID data processing.
-
Make sure you can connect to GitHub (e.g., have account, ssh keys setup, running
ssh-agent
andssh-add
). -
Make sure julia is installed and in your path (e.g.,
module load julia
on Roar). -
Make sure julia depot is somewhere you'll have sufficient storage space and isn't slow (i.e., not home on Roar).
mkdir /storage/work/USERID/julia_depot;
ln -s /storage/work/USERID/julia_depot ~/.julia
If you're truely starting from scratch, then create a new pipeline_dir. (On Roar/Roar Collab, this should be in group storage, not your home directory.)
If someone else has already setup a pipeline_dir, then just change into it (on Roar Collab, this is /storage/group/ebf11/default/pipeline/
) and skip to step 4.
$ cd pipeline_dir
If your pipeline_dir is not already prepared, then we set it up to have the following structure in the next few sub-steps:
├── shared (code with Snakemake pipeline and configuration files)
├── venv (provides python environment)
├── neid_solar
| ├── NeidSolarScripts.jl (code for neid_solar analysis)
| ├── data (will contain many large data files)
| └── work (copy neid_solar workflow files here and submit jobs)
└── neid_night
├── NeidNightScripts (code for neid_night analysis)
├── SSOFApplication (SSOF-specific code)
├── data (will contain many large data files)
└── work (copy neid_night workflow files here and submit jobs)
- Clone the snakemake pipeline and starter configuration files. Rename the folder to shared.
$ git clone [email protected]:RvSpectML/RISE_NEID_pipeline.git
$ mv RISE_NEID_pipeline shared
- Create the virtual environment and install packages such as snakemake and pyNeid
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install --upgrade pip
$ pip install -r shared/envs/requirements.txt
$ mkdir neid_solar
$ cd neid_solar
$ git clone [email protected]:RvSpectML/NeidSolarScripts.jl.git
Instantiate the NeidSolarScripts project so that julia downloads and installs dependancies. (We temporarily removed Rcall from the Project.toml as R has not been installed on RoarCollab.)
$ julia --project=NeidSolarScripts.jl
> import Pkg
> Pkg.instantiate()
> exit()
Copy the following into the NeidSolarScripts.jl folder (TODO: probably should move the following to the work directories)
- scripts/data/linelist_*.csv
(On RoarCollab, its /storage/group/ebf11/default/pipeline/neid_solar/data
.)
$ mkdir data
Copy the following data into the data sub-directory
days_to_exclude.csv
(column name date_to_exclude, entries look like 2021-10-02)pyrheliometer/*.tel
(keeping pyrheliometer subdirectory; can download those from https://neid.ipac.caltech.edu/pyrheliometer.php; seeNeidSolarScripts.jl/scripts/download_pyrheliometer_tel_files.jl
for download script )
Optionally create parallel data directory on scratch file system and add symlinks to inside the data directory, so L0 files are stored on scratch. E.g.,
mkdir -p /scratch/ebf11/pipeline/neid_solar/data/v1.1/L0;
cd data
mkdir v1.1
cd v1.1
ln -s /scratch/ebf11/pipeline/neid_solar/data/v1.1/L0 .
cd ../..
$ mkidr -p work
$ mkdir neid_night
$ cd neid_night
$ git clone [email protected]:christiangil/SSOFApplication.git
(On RoarCollab, its /storage/group/ebf11/default/pipeline/neid_night/data
.)
$ mkdir data
$ mkidr -p work
It's very similar to run neid_solar and neid_night jobs. We will take neid_solar as an example unless otherwise noted.
If your have userid USERID and a run to be named test1
, then you would create
$ mkdir -p work/USERID/test1
$ cd work/USERID/test1
Copy the template slurm script (pipeline.slurm
), Snakefile (Snakefile
) and configuration parameters file (config.yaml
) into the workspace for your analysis run (replace neid_solar_workflow with neid_night_workflow in the following code for neid_night jobs).
$ cp ../../../../shared/neid_solar_workflow/job_submission/slurm/pipeline.slurm .
$ cp ../../../../shared/neid_solar_workflow/Snakefile .
$ cp ../../../../shared/neid_solar_workflow/config.yaml .
Create file nexsci_id.toml in the workspace that includes the username and password for neid.
Create an empty data_paths.jl (touch data_paths.jl
) in NeidSolarScripts.jl. (TODO: Update make_manifest_solar.jl so it doesn't need this file. Or if it really does, make it toml file.)
Change the parameters as needed for your run:
- config.yaml
- pipeline.slurm (see the "UPDATE VARIABLES HERE" section)
$ sbatch pipeline.slurm
Each user should create their own workspace, and within their workspace, they can furtheer create subfolders for different runs, e.g. PIPELINE_ROOT/work/USERID/test1. The user then copy Snakefile, config.yaml and pipeline slurm/shell script from the shared repository to this folder, and make changes to the files in their own folder. File nexsci_id.toml that includes the username and password for neid might also be needed here.
Once snakemake is started, .snakemake will be created in this work subfolder that keeps track of the snakemake jobs. A lock will be placed when a snakemake pipeline is running, and no other runs are allowed in this subfolder. However, users can start another run in a different workspace as long as the output files will not interfere with each other (see below).
The L0/L2 data, pyrheliometer files and manifest files are relatively stable and once created, they can be shared between different users and different pipeline runs.
-
data/INPUT_VERSION/L0/
-
data/INPUT_VERSION/L2/
-
data/INPUT_VERSION/pyrheliometer/
-
data/INPUT_VERSION/manifest/
The downstream analysis often varies with different parameter sets, so we put those outputs in specific USER_ID and PIPELINE_ID subfolders.
- data/INPUT_VERSION/outputs/USER_ID/PIPELINE_ID/
In this way different users can run their own versions of downstream analysis without interfering each other.
In pipeline.slurm, there is a variable "PIPELINE_MODE".
If PIPELINE_MODE=DAILY, daily data will be downloaded and processed for each day between the given start_date and end_date. Specifically, this mode includes downloding L0/L2 data, generating pyrheliometer and manifest files, calculating ccfs and rvs, and generating daily reports.
If PIPELINE_MODE=SUMMARY, steps that generate summary reports, including report_monnthly, report_all annd combine_rvs, will be run. The report_monnthly runs on each month between (and including) the given start_date and end_date. The report_all and combine_rvs steps run on all the data in the output folder, regardles of the given start_date and end_date.
Sometimes we need to re-execute the entire or part of the pipeline, while other times we want to avoid unnecessary re-execution. Here we list some common scenarios.
L0 files are large and they are not needed in the steps beyond prep_pyro. To save storage space, we can safely discard the old L0 files that have already been processed. This will not trigger re-execution of the pipeline as long as we do not add the "--forceall" option.
Note that although removing the input file of a rule does not trigger re-execution, changes to the input file will trigger the re-execution of the downstream steps.
The input raw data on https://neid.ipac.caltech.edu/search.php may be updated from time to time. When its swversion is updated to a new major and/or minor version, we may want to download the newer version of data and re-run the pipeline. To do so, set the new swversion in config.yaml, and snakemake will detect the change and re-run the pipeline on all the dates between the given start_date and end_date, including downloading the new version of data and the downstream data processings.
When only the patch version is updated or you simply want to re-run the pipeline, add the "--forceall" option to the snakemake command.
When a rule is updated in the Snakefile, or the parameter used in the rule (either set in the Snakefile directly, passed in from the config file or from the "snakemake --config" command) is changed, that step and all the downstream steps will be re-run the next time snakemake is run. The updates will only occur for dates within the date range specified. If the date range specified does not include previous dates, then the output files would be a mix of output files generated with the old and new rules/parameters. To prevent the risk of this, when after changing a rule or pipeline parameter, one should generally run the pipeline with the full date range to make sure all affected outputs get updated.