Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaging #17

Merged
merged 32 commits into from
Nov 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6a43b0a
make the tool a package
malihass Nov 29, 2023
cc73612
format and ensure that tests pass
malihass Nov 29, 2023
10fc5bf
format
malihass Nov 29, 2023
35319a2
fix circular imports
malihass Nov 29, 2023
c4303e9
format
malihass Nov 29, 2023
4839bdb
add file finder functions
malihass Nov 29, 2023
a41260b
make a function to downsample from input
malihass Nov 29, 2023
cd8f955
remove logs
malihass Nov 30, 2023
3f61d6d
remove training logs by default
malihass Nov 30, 2023
1577b54
rename dangerous global vars
malihass Nov 30, 2023
837f9b0
format
malihass Nov 30, 2023
20adb00
reorganize tests
malihass Nov 30, 2023
5a29b22
rename parallel test
malihass Nov 30, 2023
d0005bc
update tutorials
malihass Nov 30, 2023
ce81342
reorganize test and update doc
malihass Nov 30, 2023
4f24704
format
malihass Nov 30, 2023
b3ec394
rename package
malihass Nov 30, 2023
9de847f
dist script
malihass Nov 30, 2023
6700830
format
malihass Nov 30, 2023
abefaea
update version
malihass Nov 30, 2023
e6bde16
make sure mluq works
malihass Nov 30, 2023
5912a90
fix readme images link
malihass Nov 30, 2023
2f49ce6
add badges
malihass Nov 30, 2023
9b36d04
add standalone tutorials and update global var names
malihass Nov 30, 2023
6f78b9c
update docs and make sure test runs
malihass Nov 30, 2023
26ab455
fix doc
malihass Nov 30, 2023
217fc3f
fix tests
malihass Nov 30, 2023
3671b92
make sure test can run
malihass Nov 30, 2023
b0c26c8
move inputs to package
malihass Nov 30, 2023
94cb11e
update version
malihass Nov 30, 2023
fcf4b8d
make sure parallel test runs
malihass Nov 30, 2023
aa63fef
update tutorials
malihass Nov 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 8 additions & 10 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,17 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install .
pip install pytest
- name: Formatting
run: |
source .github/linters/formatting.sh
format *.py true
format utils true
format . true
codespell
- name: Normalizing flow test
- name: Pytests
run: |
python main_iterative.py -i tests/input_test
- name: Bins test
python -m pytest tests -v --disable-warnings
- name: Parallel test
run: |
python main_iterative.py -i tests/input_test_bins
- name: Parallel normalizing flow test
run: |
mpiexec -np 2 python main_iterative.py -i tests/input_test
cd tests
mpiexec -np 2 python main_from_input.py -i ../uips/inputs/input_test
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,14 @@ dmypy.json

# Cython debug symbols
cython_debug/

# tmp
*.swp

# data
*.npz
.DS_Store

# output
Figures
TrainingLog*
108 changes: 80 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,94 @@
# Phase-space sampling of large datasets [![UIPS-CI](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml/badge.svg)](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml)
# Phase-space sampling of large datasets [![UIPS-CI](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml/badge.svg)](https://github.com/NREL/Phase-space-sampling/actions/workflows/ci.yml) [![UIPS-pypi](https://badge.fury.io/py/uips.svg)](https://badge.fury.io/py/uips)

## Installation for NREL HPC users
1. `module load openmpi/4.1.0/gcc-8.4.0`
2. `conda activate /projects/mluq/condaEnvs/uips`

## Installation for other users

### Option 1: From `conda` (recommended)
## Installation for developers

1. `conda create --name uips python=3.10`
2. `conda activate uips`
3. `pip install -r requirements.txt`
3. `pip install .`

### Option 2: From `poetry`
Test

This requires [poetry](https://python-poetry.org/docs/#installation)
1. `poetry update`
```
bash tutorials/run2D.sh
```

## Purpose
## Installation for users

The purpose of the tool is to perform a smart downselection of a large number of datapoints. Typically, large numerical simulations generate billions, or even trillions of datapoints. However, there may be redundancy in the dataset which unnecessarily constrains memory and computing requirements. Here, redundancy is defined as closeness in feature space. The method is called phase-space sampling.
`pip install uips`

## Running the example without poetry
Test

`bash run2D.sh`: Example of downsampling a 2D combustion dataset. First the downsampling is performed (`mpiexec -np 4 python main_iterative.py input`). Then the loss function for each flow iteration is plotted (`python plotLoss.py input`). Finally, the samples are visualized (`python visualizeDownSampled_subplots.py input`). All figures are saved under the folder `Figures`.
```
import os

import numpy as np
from prettyPlot.parser import parse_input_file
from prettyPlot.plotting import plt, pretty_labels, pretty_legend

from uips import UIPS_INPUT_DIR
from uips.wrapper import downsample_dataset_from_input
import uips.utils.parallel as par

ndat = int(1e5)
nSampl = 100
ds = np.random.multivariate_normal([0, 0], np.eye(2), size=ndat)
np.save("norm.npy", ds)

inpt = parse_input_file(os.path.join(UIPS_INPUT_DIR, "input2D"))
inpt["dataFile"] = "norm.npy"
inpt["nWorkingData"] = f"{ndat} {ndat}"
inpt["nEpochs"] = f"5 20"
inpt["nSamples"] = f"{nSampl}"

best_files = downsample_dataset_from_input(inpt)


if par.irank == par.iroot:
downsampled_ds = {}
for nsamp in best_files:
downsampled_ds[nsamp] = np.load(best_files[nsamp])["data"]
fig = plt.figure()
plt.plot(ds[:, 0], ds[:, 1], "o", color="k", label="full DS")
plt.plot(
downsampled_ds[nSampl][:, 0],
downsampled_ds[nSampl][:, 1],
"o",
color="r",
label="downsampled",
)
pretty_labels("", "", 14)
pretty_legend()
plt.savefig("normal_downsample.png")
```

## Running the example with poetry
## Purpose

Add `poetry run` before `python ...`
The purpose of the tool is to perform a smart downselection of a large number of datapoints. Typically, large numerical simulations generate billions, or even trillions of datapoints. However, there may be redundancy in the dataset which unnecessarily constrains memory and computing requirements. Here, redundancy is defined as closeness in feature space. The method is called phase-space sampling.

## Running the example

`bash tutorials/run2D.sh`: Example of downsampling a 2D combustion dataset. First the downsampling is performed (`mpiexec -np 4 python tests/main_from_input.py -i inputs/input2D`). Then the loss function for each flow iteration is plotted (`python postProcess/plotLoss.py -i inputs/input2D`). Finally, the samples are visualized (`python postProcess/visualizeDownSampled_subplots.py -i inputs/input2D`). All figures are saved under the folder `Figures`.

## Parallelization

The code is GPU+MPI-parallelized: a) the dataset is loaded and shuffled in parallel b) the probability evaluation (the most expensive step) is done in parallel c) downsampling is done in parallel d) only the training is offloaded to a GPU if available. Memory usage of root processor is higher than other since it is the only one in charge of the normalizing flow training and sampling probability adjustment. To run the code in parallel, `mpiexec -np num_procs python main_iterative.py input`.
The code is GPU+MPI-parallelized: a) the dataset is loaded and shuffled in parallel b) the probability evaluation (the most expensive step) is done in parallel c) downsampling is done in parallel d) only the training is offloaded to a GPU if available. Memory usage of root processor is higher than other since it is the only one in charge of the normalizing flow training and sampling probability adjustment. To run the code in parallel, `mpiexec -np num_procs python tests/main_from_input.py -i inputs/input2D`.

In the code, arrays with suffix `_` denote data distributed over the processors.

The computation of nearest neighbor distance is parallelized using the sklearn implementation. It will be accelerated on systems where hyperthreading is enabled (your laptop, but NOT the Eagle HPC)

When using GPU+MPI-parallelism on Eagle, you need to specify the number of MPI tasks (`srun -n 36 python main_iterative.py`)
When using MPI-parallelism alone on Eagle, you do not need to specify the number of MPI tasks (`srun python main_iterative.py`)
When using GPU+MPI-parallelism on Eagle, you need to specify the number of MPI tasks (`srun -n 36 python tests/main_from_input.py`)
When using MPI-parallelism alone on Eagle, you do not need to specify the number of MPI tasks (`srun python tests/main_from_input.py`)

Running on GPU only accelerate execution by ~30% for the examples provided here. Running with many MPI-tasks linearly decreases the execution time for probability evaluation, as well as memory per core requirements.

Parallelization tested with up to 36 cores on Eagle.

Parallelization tested with up to 4 cores on MacOS Catalina v10.15.7.
Parallelization tested with up to 4 cores on MacOS Monterey v12.7.1.

## Data

Expand All @@ -56,9 +100,9 @@ The dataset to downsample has size $N \times d$ where $N \gg d$. The first dimen

## Hyperparameters

All hyperparameters can be controlled via an input file (see `run2D.sh`).
All hyperparameters can be controlled via an input file (see `tutorials/run2D.sh`).
We recommend fixing the number of flow calculation iteration to 2.
When increasing the number of dimensions, we recommend adjusting the hyperparameters. A 2-dimensional example (`input`) and an 11-dimensional (`highdim/input11D`) example are provided to guide the user.
When increasing the number of dimensions, we recommend adjusting the hyperparameters. A 2-dimensional example (`inputs/input2D`) and an 11-dimensional (`inputs/highdim/input11D`) example are provided to guide the user.

## Sanity checks

Expand All @@ -72,39 +116,47 @@ The computational cost associated with the nearest neighbor computations scales

During training of the normalizing flow, the negative log likelihood is displayed. The user should ensure that the normalizing flow has learned something useful about the distribution by ensuring that the loss is close to being converged. The log of the loss is displayed as a csv file in the folder `TrainingLog`. The loss of the second training iteration should be higher than the first iteration. If this is not the case or if more iterations are needed, the normalizing flow trained may need to be better converged. A warning message will be issued in that case.

A script is provided to visualize the losses. Execute `python plotLoss.py input` where `input` is the name of the input file used to perform the downsampling.
A script is provided to visualize the losses. Execute `python plotLoss.py -i inputs/input2D` where `input2D` is the name of the input file used to perform the downsampling.

## Example 2D

Suppose one wants to downsample an dataset where $N=10^7$ and $d=2$. First, the code estimates the probability map of the data in order to identify where are located redundant data points. An example dataset (left) and associated probability map (right) are shown below

<p float="left">
<img src="readmeImages/fulldataset.png" width="350"/>
<img src="readmeImages/probabilityMap.png" width="350"/>
<img src="documentation/readmeImages/fulldataset.png" width="350"/>
<img src="documentation/readmeImages/probabilityMap.png" width="350"/>
</p>

Next, the code uses the probability map to define a sampling probability which downselect samples that uniformly span the feature space. The probability map is obtained by training a Neural Spline Flow which implementation was obtained from [Neural Spline Flow repository](https://github.com/bayesiains/nsf). The number of samples in the final dataset can be controlled via the input file.

<p float="left">
<img src="readmeImages/103_phaseSpaceSampling.png" width="350"/>
<img src="readmeImages/104_phaseSpaceSampling.png" width="350"/>
<img src="documentation/readmeImages/103_uips.png" width="350"/>
<img src="documentation/readmeImages/104_uips.png" width="350"/>
</p>

For comparison, a random sampling gives the following result

<p float="left">
<img src="readmeImages/103_randomSampling.png" width="350"/>
<img src="readmeImages/104_randomSampling.png" width="350"/>
<img src="documentation/readmeImages/103_randomSampling.png" width="350"/>
<img src="documentation/readmeImages/104_randomSampling.png" width="350"/>
</p>

## Example 11D

Input file is provided in `highdim/input11D`
Input file is provided in `inputs/highdim/input11D`

## Data efficient ML

The folder `data-efficientML` is NOT necessary for using the phase-space sampling package. It only contains the code necessary to reproduce the results shown in the paper:

## Formatting [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

Code formatting and import sorting are done automatically with `black` and `isort`.

Fix imports and format : `pip install black isort; bash fixFormat.sh`

Spelling is checked but not automatically fixed using `codespell`

## Reference

[Published version (open access)](https://www.cambridge.org/core/journals/data-centric-engineering/article/uniforminphasespace-data-selection-with-iterative-normalizing-flows/E6212E3FCB5A7EE7B1399BA49667B84C)
Expand Down
3 changes: 0 additions & 3 deletions cutils/__init__.py

This file was deleted.

71 changes: 0 additions & 71 deletions cutils/io.py

This file was deleted.

17 changes: 0 additions & 17 deletions cutils/misc.py

This file was deleted.

2 changes: 1 addition & 1 deletion data-efficientML/artificialCase/trainGP.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
import os
import warnings

from prettyPlot.progressBar import print_progress_bar
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import ConstantKernel as C
from sklearn.gaussian_process.kernels import WhiteKernel
from prettyPlot.progressBar import print_progress_bar


def partitionData(nData, nBatch):
Expand Down
9 changes: 5 additions & 4 deletions data-efficientML/artificialCase/trainNN.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@
# NN Stuff
import tensorflow as tf
from myNN_better import *
from parallel import irank, iroot
from prettyPlot.progressBar import print_progress_bar
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from prettyPlot.progressBar import print_progress_bar
from parallel import irank, iroot


def partitionData(nData, nBatch):
# ~~~~ Partition the data across batches
Expand Down Expand Up @@ -43,7 +44,7 @@ def getPrediction(model, data):
prefix="Eval " + str(0) + " / " + str(nBatch),
suffix="Complete",
length=50,
extraCond=(irank==iroot),
extraCond=(irank == iroot),
)
for ibatch in range(nBatch):
start_ = startData_b[ibatch]
Expand All @@ -55,7 +56,7 @@ def getPrediction(model, data):
prefix="Eval " + str(ibatch + 1) + " / " + str(nBatch),
suffix="Complete",
length=50,
extraCond=(irank==iroot),
extraCond=(irank == iroot),
)

return result
Expand Down
Loading