Skip to content

Commit

Permalink
Merge pull request #17 from NREL/packaging
Browse files Browse the repository at this point in the history
Packaging
  • Loading branch information
malihass authored Nov 30, 2023
2 parents 9f9b278 + aa63fef commit 601c2c1
Show file tree
Hide file tree
Showing 129 changed files with 3,903 additions and 2,868 deletions.
18 changes: 8 additions & 10 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,17 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install .
pip install pytest
- name: Formatting
run: |
source .github/linters/formatting.sh
format *.py true
format utils true
format . true
codespell
- name: Normalizing flow test
- name: Pytests
run: |
python main_iterative.py -i tests/input_test
- name: Bins test
python -m pytest tests -v --disable-warnings
- name: Parallel test
run: |
python main_iterative.py -i tests/input_test_bins
- name: Parallel normalizing flow test
run: |
mpiexec -np 2 python main_iterative.py -i tests/input_test
cd tests
mpiexec -np 2 python main_from_input.py -i ../uips/inputs/input_test
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,14 @@ dmypy.json

# Cython debug symbols
cython_debug/

# tmp
*.swp

# data
*.npz
.DS_Store

# output
Figures
TrainingLog*
108 changes: 80 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,94 @@
# Phase-space sampling of large datasets [![UIPS-CI](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml/badge.svg)](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml)
# Phase-space sampling of large datasets [![UIPS-CI](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml/badge.svg)](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml) [![UIPS-pypi](https://badge.fury.io/py/uips.svg)](https://badge.fury.io/py/uips)

## Installation for NREL HPC users
1. `module load openmpi/4.1.0/gcc-8.4.0`
2. `conda activate /projects/mluq/condaEnvs/uips`

## Installation for other users

### Option 1: From `conda` (recommended)
## Installation for developers

1. `conda create --name uips python=3.10`
2. `conda activate uips`
3. `pip install -r requirements.txt`
3. `pip install .`

### Option 2: From `poetry`
Test

This requires [poetry](https://python-poetry.org/docs/#installation)
1. `poetry update`
```
bash tutorials/run2D.sh
```

## Purpose
## Installation for users

The purpose of the tool is to perform a smart downselection of a large number of datapoints. Typically, large numerical simulations generate billions, or even trillions of datapoints. However, there may be redundancy in the dataset which unnecessarily constrains memory and computing requirements. Here, redundancy is defined as closeness in feature space. The method is called phase-space sampling.
`pip install uips`

## Running the example without poetry
Test

`bash run2D.sh`: Example of downsampling a 2D combustion dataset. First the downsampling is performed (`mpiexec -np 4 python main_iterative.py input`). Then the loss function for each flow iteration is plotted (`python plotLoss.py input`). Finally, the samples are visualized (`python visualizeDownSampled_subplots.py input`). All figures are saved under the folder `Figures`.
```
import os
import numpy as np
from prettyPlot.parser import parse_input_file
from prettyPlot.plotting import plt, pretty_labels, pretty_legend
from uips import UIPS_INPUT_DIR
from uips.wrapper import downsample_dataset_from_input
import uips.utils.parallel as par
ndat = int(1e5)
nSampl = 100
ds = np.random.multivariate_normal([0, 0], np.eye(2), size=ndat)
np.save("norm.npy", ds)
inpt = parse_input_file(os.path.join(UIPS_INPUT_DIR, "input2D"))
inpt["dataFile"] = "norm.npy"
inpt["nWorkingData"] = f"{ndat} {ndat}"
inpt["nEpochs"] = f"5 20"
inpt["nSamples"] = f"{nSampl}"
best_files = downsample_dataset_from_input(inpt)
if par.irank == par.iroot:
downsampled_ds = {}
for nsamp in best_files:
downsampled_ds[nsamp] = np.load(best_files[nsamp])["data"]
fig = plt.figure()
plt.plot(ds[:, 0], ds[:, 1], "o", color="k", label="full DS")
plt.plot(
downsampled_ds[nSampl][:, 0],
downsampled_ds[nSampl][:, 1],
"o",
color="r",
label="downsampled",
)
pretty_labels("", "", 14)
pretty_legend()
plt.savefig("normal_downsample.png")
```

## Running the example with poetry
## Purpose

Add `poetry run` before `python ...`
The purpose of the tool is to perform a smart downselection of a large number of datapoints. Typically, large numerical simulations generate billions, or even trillions of datapoints. However, there may be redundancy in the dataset which unnecessarily constrains memory and computing requirements. Here, redundancy is defined as closeness in feature space. The method is called phase-space sampling.

## Running the example

`bash tutorials/run2D.sh`: Example of downsampling a 2D combustion dataset. First the downsampling is performed (`mpiexec -np 4 python tests/main_from_input.py -i inputs/input2D`). Then the loss function for each flow iteration is plotted (`python postProcess/plotLoss.py -i inputs/input2D`). Finally, the samples are visualized (`python postProcess/visualizeDownSampled_subplots.py -i inputs/input2D`). All figures are saved under the folder `Figures`.

## Parallelization

The code is GPU+MPI-parallelized: a) the dataset is loaded and shuffled in parallel b) the probability evaluation (the most expensive step) is done in parallel c) downsampling is done in parallel d) only the training is offloaded to a GPU if available. Memory usage of root processor is higher than other since it is the only one in charge of the normalizing flow training and sampling probability adjustment. To run the code in parallel, `mpiexec -np num_procs python main_iterative.py input`.
The code is GPU+MPI-parallelized: a) the dataset is loaded and shuffled in parallel b) the probability evaluation (the most expensive step) is done in parallel c) downsampling is done in parallel d) only the training is offloaded to a GPU if available. Memory usage of root processor is higher than other since it is the only one in charge of the normalizing flow training and sampling probability adjustment. To run the code in parallel, `mpiexec -np num_procs python tests/main_from_input.py -i inputs/input2D`.

In the code, arrays with suffix `_` denote data distributed over the processors.

The computation of nearest neighbor distance is parallelized using the sklearn implementation. It will be accelerated on systems where hyperthreading is enabled (your laptop, but NOT the Eagle HPC)

When using GPU+MPI-parallelism on Eagle, you need to specify the number of MPI tasks (`srun -n 36 python main_iterative.py`)
When using MPI-parallelism alone on Eagle, you do not need to specify the number of MPI tasks (`srun python main_iterative.py`)
When using GPU+MPI-parallelism on Eagle, you need to specify the number of MPI tasks (`srun -n 36 python tests/main_from_input.py`)
When using MPI-parallelism alone on Eagle, you do not need to specify the number of MPI tasks (`srun python tests/main_from_input.py`)

Running on GPU only accelerate execution by ~30% for the examples provided here. Running with many MPI-tasks linearly decreases the execution time for probability evaluation, as well as memory per core requirements.

Parallelization tested with up to 36 cores on Eagle.

Parallelization tested with up to 4 cores on MacOS Catalina v10.15.7.
Parallelization tested with up to 4 cores on MacOS Monterey v12.7.1.

## Data

Expand All @@ -56,9 +100,9 @@ The dataset to downsample has size $N \times d$ where $N \gg d$. The first dimen

## Hyperparameters

All hyperparameters can be controlled via an input file (see `run2D.sh`).
All hyperparameters can be controlled via an input file (see `tutorials/run2D.sh`).
We recommend fixing the number of flow calculation iteration to 2.
When increasing the number of dimensions, we recommend adjusting the hyperparameters. A 2-dimensional example (`input`) and an 11-dimensional (`highdim/input11D`) example are provided to guide the user.
When increasing the number of dimensions, we recommend adjusting the hyperparameters. A 2-dimensional example (`inputs/input2D`) and an 11-dimensional (`inputs/highdim/input11D`) example are provided to guide the user.

## Sanity checks

Expand All @@ -72,39 +116,47 @@ The computational cost associated with the nearest neighbor computations scales

During training of the normalizing flow, the negative log likelihood is displayed. The user should ensure that the normalizing flow has learned something useful about the distribution by ensuring that the loss is close to being converged. The log of the loss is displayed as a csv file in the folder `TrainingLog`. The loss of the second training iteration should be higher than the first iteration. If this is not the case or if more iterations are needed, the normalizing flow trained may need to be better converged. A warning message will be issued in that case.

A script is provided to visualize the losses. Execute `python plotLoss.py input` where `input` is the name of the input file used to perform the downsampling.
A script is provided to visualize the losses. Execute `python plotLoss.py -i inputs/input2D` where `input2D` is the name of the input file used to perform the downsampling.

## Example 2D

Suppose one wants to downsample an dataset where $N=10^7$ and $d=2$. First, the code estimates the probability map of the data in order to identify where are located redundant data points. An example dataset (left) and associated probability map (right) are shown below

<p float="left">
<img src="readmeImages/fulldataset.png" width="350"/>
<img src="readmeImages/probabilityMap.png" width="350"/>
<img src="documentation/readmeImages/fulldataset.png" width="350"/>
<img src="documentation/readmeImages/probabilityMap.png" width="350"/>
</p>

Next, the code uses the probability map to define a sampling probability which downselect samples that uniformly span the feature space. The probability map is obtained by training a Neural Spline Flow which implementation was obtained from [Neural Spline Flow repository](https://github.com/bayesiains/nsf). The number of samples in the final dataset can be controlled via the input file.

<p float="left">
<img src="readmeImages/103_phaseSpaceSampling.png" width="350"/>
<img src="readmeImages/104_phaseSpaceSampling.png" width="350"/>
<img src="documentation/readmeImages/103_uips.png" width="350"/>
<img src="documentation/readmeImages/104_uips.png" width="350"/>
</p>

For comparison, a random sampling gives the following result

<p float="left">
<img src="readmeImages/103_randomSampling.png" width="350"/>
<img src="readmeImages/104_randomSampling.png" width="350"/>
<img src="documentation/readmeImages/103_randomSampling.png" width="350"/>
<img src="documentation/readmeImages/104_randomSampling.png" width="350"/>
</p>

## Example 11D

Input file is provided in `highdim/input11D`
Input file is provided in `inputs/highdim/input11D`

## Data efficient ML

The folder `data-efficientML` is NOT necessary for using the phase-space sampling package. It only contains the code necessary to reproduce the results shown in the paper:

## Formatting [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

Code formatting and import sorting are done automatically with `black` and `isort`.

Fix imports and format : `pip install black isort; bash fixFormat.sh`

Spelling is checked but not automatically fixed using `codespell`

## Reference

[Published version (open access)](https://www.cambridge.org/core/journals/data-centric-engineering/article/uniforminphasespace-data-selection-with-iterative-normalizing-flows/E6212E3FCB5A7EE7B1399BA49667B84C)
Expand Down
3 changes: 0 additions & 3 deletions cutils/__init__.py

This file was deleted.

71 changes: 0 additions & 71 deletions cutils/io.py

This file was deleted.

17 changes: 0 additions & 17 deletions cutils/misc.py

This file was deleted.

2 changes: 1 addition & 1 deletion data-efficientML/artificialCase/trainGP.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
import os
import warnings

from prettyPlot.progressBar import print_progress_bar
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import ConstantKernel as C
from sklearn.gaussian_process.kernels import WhiteKernel
from prettyPlot.progressBar import print_progress_bar


def partitionData(nData, nBatch):
Expand Down
9 changes: 5 additions & 4 deletions data-efficientML/artificialCase/trainNN.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@
# NN Stuff
import tensorflow as tf
from myNN_better import *
from parallel import irank, iroot
from prettyPlot.progressBar import print_progress_bar
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from prettyPlot.progressBar import print_progress_bar
from parallel import irank, iroot


def partitionData(nData, nBatch):
# ~~~~ Partition the data across batches
Expand Down Expand Up @@ -43,7 +44,7 @@ def getPrediction(model, data):
prefix="Eval " + str(0) + " / " + str(nBatch),
suffix="Complete",
length=50,
extraCond=(irank==iroot),
extraCond=(irank == iroot),
)
for ibatch in range(nBatch):
start_ = startData_b[ibatch]
Expand All @@ -55,7 +56,7 @@ def getPrediction(model, data):
prefix="Eval " + str(ibatch + 1) + " / " + str(nBatch),
suffix="Complete",
length=50,
extraCond=(irank==iroot),
extraCond=(irank == iroot),
)

return result
Expand Down
Loading

0 comments on commit 601c2c1

Please sign in to comment.