diff --git a/.travis.yml b/.travis.yml index 31c7f70b..087fdb66 100644 --- a/.travis.yml +++ b/.travis.yml @@ -34,12 +34,12 @@ env: # before_install: install: - - sh install-mpi.sh + - sh ci/travis/install-mpi.sh - export MPI_PREFIX="${HOME}/opt/${MPI_LIBRARY}-${MPI_LIBRARY_VERSION}" - export PATH="${HOME}/.local/bin:${MPI_PREFIX}/bin${PATH:+":${PATH}"}" - export LD_LIBRARY_PATH="${MPI_PREFIX}/lib${LD_LIBRARY_PATH:+":${LD_LIBRARY_PATH}"}" - pip install --upgrade pip - - pip install -r requirements-travis.txt + - pip install -r envs/pip-requirements-travis.txt # before_script: @@ -54,6 +54,8 @@ stages: notifications: email: + recipients: + - felker@anl.gov on_success: change on_failure: always slack: diff --git a/jenkins-ci/jenkins.sh b/ci/jenkins/jenkins.sh similarity index 100% rename from jenkins-ci/jenkins.sh rename to ci/jenkins/jenkins.sh diff --git a/jenkins-ci/run_jenkins.py b/ci/jenkins/run_jenkins.py similarity index 100% rename from jenkins-ci/run_jenkins.py rename to ci/jenkins/run_jenkins.py diff --git a/jenkins-ci/validate_jenkins.py b/ci/jenkins/validate_jenkins.py similarity index 100% rename from jenkins-ci/validate_jenkins.py rename to ci/jenkins/validate_jenkins.py diff --git a/jenkins-ci/validate_jenkins.sh b/ci/jenkins/validate_jenkins.sh similarity index 100% rename from jenkins-ci/validate_jenkins.sh rename to ci/jenkins/validate_jenkins.sh diff --git a/install-mpi.sh b/ci/travis/install-mpi.sh similarity index 100% rename from install-mpi.sh rename to ci/travis/install-mpi.sh diff --git a/data/signals.py b/data/signals.py index 12e26e4b..18e4e521 100644 --- a/data/signals.py +++ b/data/signals.py @@ -85,12 +85,9 @@ def get_units(str): if found: if rank > 1: xdata = c.get('dim_of(_s,1)').data() - # xunits = get_units('dim_of(_s,1)') ydata = c.get('dim_of(_s)').data() - # yunits = get_units('dim_of(_s)') else: xdata = c.get('dim_of(_s)').data() - # xunits = get_units('dim_of(_s)') # MDSplus seems to return 2-D arrays transposed. Change them back. if np.ndim(data) == 2: @@ -406,6 +403,11 @@ def fetch_nstx_data(signal_path, shot_num, c): # 'tmamp1':tmamp1, 'tmamp2':tmamp2, 'tmfreq1':tmfreq1, 'tmfreq2':tmfreq2, # 'pechin':pechin, # 'rho_profile_spatial':rho_profile_spatial, 'etemp':etemp, + # ----- + # TODO(KGF): replace this hacky workaround + # IMPORTANT: must comment-out the following line when preprocessing for + # training on JET CW and testing on JET ILW (FRNN 0D). + # Otherwise 1K+ CW shots are excluded due to missing profile data 'etemp_profile': etemp_profile, 'edens_profile': edens_profile, # 'itemp_profile':itemp_profile, 'zdens_profile':zdens_profile, # 'trot_profile':trot_profile, 'pthm_profile':pthm_profile, diff --git a/docs/ALCF.md b/docs/ALCF.md new file mode 100644 index 00000000..828e4aae --- /dev/null +++ b/docs/ALCF.md @@ -0,0 +1,385 @@ +# ALCF Theta `plasma-python` FRNN Notes + +**Author: Rick Zamora (rzamora@anl.gov)** + +This document is intended to act as a tutorial for running the [plasma-python](https://github.com/PPPLDeepLearning/plasma-python) implementation of the Fusion recurrent neural network (FRNN) on the ALCF Theta supercomputer (Cray XC40; Intel KNL processors). The steps followed in these notes are based on the Princeton [Tiger-GPU tutorial](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/PrincetonUTutorial.md#location-of-the-data-on-tigress), hosted within the main GitHub repository for the project. + +## Environment Setup + + +Choose a *root* directory for FRNN-related installations on Theta: + +``` +export FRNN_ROOT= +cd $FRNN_ROOT +``` + +*Personal Note: Using FRNN_ROOT=/home/zamora/ESP* + +Create a simple directory structure allowing experimental *builds* of the `plasma-python` python code/library: + +``` +mkdir build +mkdir build/miniconda-3.6-4.5.4 +cd build/miniconda-3.6-4.5.4 +``` + +### Custom Miniconda Environment Setup + +Copy miniconda installation script to working directory (and install): + +``` +cp /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/install_miniconda-3.6-4.5.4.sh . +./install_miniconda-3.6-4.5.4.sh +``` + +The `install_miniconda-3.6-4.5.4.sh` script will install `miniconda-4.5.4` (using `Python-3.6`), as well as `Tensorflow-1.12.0` and `Keras 2.2.4`. + + +Update your environment variables to use miniconda: + +``` +export PATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/bin:$PATH +export PYTHONPATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/lib/python3.6/site-packages/:$PYTHONPATH +``` + +Note that the previous lines (as well as the definition of `FRNN_ROOT`) can be appended to your `$HOME/.bashrc` file if you want to use this environment on Theta by default. + + +## Installing `plasma-python` + +Here, we assume the installation is within the custom miniconda environment installed in the previous steps. We also assume the following commands have already been executed: + +``` +export FRNN_ROOT= +export PATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/bin:$PATH +export PYTHONPATH=${FRNN_ROOT}/build/miniconda-3.6-4.5.4/miniconda3/4.5.4/lib/python3.6/site-packages/:$PYTHONPATH +``` + +*Personal Note: Using `export FRNN_ROOT=/lus/theta-fs0/projects/fusiondl_aesp/zamora/FRNN_project`* + +If the environment is set up correctly, installation of `plasma-python` is straightforward: + +``` +cd ${FRNN_ROOT}/build/miniconda-3.6-4.5.4 +git clone https://github.com/PPPLDeepLearning/plasma-python.git +cd plasma-python +python setup.py build +python setup.py install +``` + +## Data Access + +Sample data and metadata is available in `/lus/theta-fs0/projects/FRNN/tigress/alexeys/signal_data` and `/lus/theta-fs0/projects/FRNN/tigress/alexeys/shot_lists`, respectively. It is recommended that users create their own symbolic links to these directories. I recommend that you do this within a directory called `/lus/theta-fs0/projects/fusiondl_aesp//`. For example: + +``` +ln -s /lus/theta-fs0/projects/fusiondl_aesp/FRNN/tigress/alexeys/shot_lists  /lus/theta-fs0/projects/fusiondl_aesp//shot_lists +ln -s /lus/theta-fs0/projects/fusiondl_aesp/FRNN/tigress/alexeys/signal_data  /lus/theta-fs0/projects/fusiondl_aesp//signal_data +``` + +For the examples included in `plasma-python`, there is a configuration file that specifies the root directory of the raw data. Change the `fs_path: '/tigress'` line in `examples/conf.yaml` to reflect the following: + +``` +fs_path: '/lus/theta-fs0/projects/fusiondl_aesp' +``` + +Its also a good idea to change `num_gpus: 4` to `num_gpus: 1`. I am also using the `jet_data_0D` dataset: + +``` +paths: + data: jet_data_0D +``` + + +### Data Preprocessing + +#### The SLOW Way (On Theta) + +Theta is KNL-based, and is **not** the best resource for processing many text files in python. However, the preprocessing step *can* be used by using the following steps (although it may need to be repeated many times to get through the whole dataset in a 60-minute debug queues): + +``` +cd ${FRNN_ROOT}/build/miniconda-3.6-4.5.4/plasma-python/examples +cp /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/submit_guarantee_preprocessed.sh . +``` + +Modify the paths defined in `submit_guarantee_preprocessed.sh` to match your environment. + +Note that the preprocessing module will use Pathos multiprocessing (not MPI/mpi4py). Therefore, the script will see every compute core (all 256 per node) as an available resource. Since the LUSTRE file system is unlikely to perform well with 256 processes (on the same node) opening/closing/creating files at once, it might improve performance if you make a slight change to line 85 in the `vi ~/plasma-python/plasma/preprocessor/preprocess.py` file: + +``` +line 85: use_cores = min( , max(1,mp.cpu_count()-2) ) +``` + +After optionally re-building and installing plasm-python with this change, submit the preprocessing job: + +``` +qsub submit_guarantee_preprocessed.sh +``` + +#### The FAST Way (On Cooley) + +You will fine it much less painful to preprocess the data on Cooley, because the Haswell processors are much better suited for this... Log onto the ALCF Cooley Machine: + +``` +ssh @cooley.alcf.anl.gov +``` + +Copy my `cooley_preprocess` example directory to whatever directory you choose to work in: + +``` +cp -r /lus/theta-fs0/projects/fusiondl_aesp/FRNN/rzamora/scripts/cooley_preprocess . +cd cooley_preprocess +``` + +This directory has a Singularity image with everything you need to run your code on Cooley. Assuming you have created symbolic links to the `shot_lists` and `signal_data` directories in `/lus/theta-fs0/projects/fusiondl_aesp//`, you can just submit the included `COBALT` script (to specify the data you want to process, just modify the included `conf.yaml` file): + +``` +qsub submit.sh +``` + +For me, this finishes in less than 10 minutes, and creates 5523 `.npz` files in the `/lus/theta-fs0/projects/fusiondl_aesp//processed_shots/` directory. The output file of the COBALT submission ends with the following message: + +``` +5522/5523Finished Preprocessing 5523 files in 406.94421911239624 seconds +Omitted 5523 shots of 5523 total. +0/0 disruptive shots +WARNING: All shots were omitted, please ensure raw data is complete and available at /lus/theta-fs0/projects/fusiondl_aesp/zamora/signal_data/. +4327 1196 +``` + + +# Notes on Revisiting Pre-Processes + +## Preprocessing Information + +To understand what might be going wrong with the preprocessing step, let's investigate what the code is actually doing. + +**Step 1** Call `guarentee_preprocessed( conf )`, which is defined in `plasma/preprocessor/preprocess.py`. This function first initializes a `Preprocessor()` object (whose class definition is in the same file), and then checks if the preprocessing was already done (by looking for a file). The preprocessor object is called `pp`. + +**Step 2** Assuming preprocessing is needed, we call `pp.clean_shot_lists()`, which loops through each file in the `shot_lists` directory and calls `self.clean_shot_list()` (not plural) for each text-file item. I do not believe this function is doing any thing when I run it, because all the shot list files have been "cleaned." The cleaning of a shot-list file just means the data is corrected to have two columns, and the file is renamed (to have "clear" in the name). + +**Step 3** We call `pp.preprocess_all()`, which parses some of the config file, and ultimately calls `self.preprocess_from_files(shot_files_all,use_shots)` (where I believe `shot_files_all` is the output directory, and `use_shots` is the number of shots to use). + +**Step 4** The `preprocess_from_files()` function is used to do the actual preprocessing. It does this by creating a multiprocessing pool, and mapping the processes to the `self.preprocess_single_file` function (note that the code for `ShotList` class is in `plasma/primitives/shots.py`, and the preprocessing code is still in `plasma/preprocessor/preprocess.py`). + +**Important:** It looks like the code uses the path definitions in `data/shot_lists/signals.py` to define the location/path of signal data. I believe that some of the signal data is missing, which is causing every "shot" to be labeled as incomplete (and consequently thrown out). + +### Possible Issues + +From the preprocessing output, it is clear that the *Signal Radiated Power Core* data was not downloaded correctly. According to the `data/shot_lists/signals.py` file, the data *should* be in `/lus/theta-fs0/projects/fusiondl_aesp//signal_data/jet/ppf/bolo/kb5h/channel14`. However, the only subdirectory of `~/jet/ppf/` is `~/jet/ppf/efit` + +Another possible issue is that the `data/shot_lists/signals.py` file specifies the **name** of the directory containing the *Radiated Power* data incorrectly (*I THINK*). Instead of the following line: + +`pradtot = Signal("Radiated Power",['jpf/db/b5r-ptot>out'],[jet])` + +We might need this: + +`pradtot = Signal("Radiated Power",['jpf/db/b5r-ptot\>out'],[jet])` + +The issue has to do with the `>` character in the directory name (without the proper `\` escape character, python may be looking in the wrong path). **NOTE: I need to confirm that there is actually an issue with the way the code is actually using the string.** + + +## Singularity/Docker Notes + +Recall that the data preprocessing step was PAINFULLY slow on Theta, and so I decided to use Cooley. To simplify the process of using Cooley, I created a Docker image with the necessary environment. **Personal Note:** I performed this work on my local machine (Mac) in `/Users/rzamora/container-recipes`. + + +In order to use a Docker image within a Singularity container (required on ALCF machines), it is useful to build the image on your local machine and push it to "Docker Hub": + + +**Step 1:** Install Docker if you don't have it. [Docker-Mac](https://www.docker.com/docker-mac) works well for Mac. + +**Step 2:** Build a Docker image using the recipe discussed below. + +``` +export IMAGENAME="test_image" +export RECIPENAME="Docker.centos7-cuda-tf1.12.0" +docker build -t $IMAGENAME -f $RECIPENAME . +``` + +You can check that the image is functional by starting an interactive shell session, and checking that the necessary python modules are available. For example (using `-it` for an interactive session): + +``` +docker run --rm -it -v $PWD:/tmp -w /tmp $IMAGENAME:latest bash +# python -c "import keras; import plasma; print(plasma.__file__)" +``` + +Note that the `plasma-python` source code will be located in `/root/plasma-python/` for the recipe described below. + +**Step 3:** Push the image to [Docker Hub](https://hub.docker.com/). + +Using your docker-hub username: + +``` +docker login --username= +``` + +Then, "tag" the image using the `IMAGE ID` value displayed with `docker image ls`: + +``` +docker tag /: