Skip to content

Commit

Permalink
Merge pull request #117 from ctlearn-project/dl1dh
Browse files Browse the repository at this point in the history
Integrate DL1DataHandler
  • Loading branch information
aribrill authored Jun 23, 2019
2 parents 60335ad + 3b50d61 commit 258c1aa
Show file tree
Hide file tree
Showing 44 changed files with 875 additions and 4,421 deletions.
2 changes: 0 additions & 2 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1 @@
include ctlearn/pixel_pos_files/*.npy
include ctlearn/pixel_pos_files/*.fits
include ctlearn/default_models/*
90 changes: 43 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,22 @@

[![Build Status](https://travis-ci.com/ctlearn-project/ctlearn.svg?branch=master)](https://travis-ci.com/ctlearn-project/ctlearn)

![Validation Accuracy](images/CTLearnTextCTinBox_WhiteBkgd.png)
![CTLearn Logo](images/CTLearnTextCTinBox_WhiteBkgd.png)

CTLearn is a package under active development to run deep learning models to analyze data from all major current and future arrays of Imaging Atmospheric Cherenkov Telescopes (IACTs). CTLearn v0.3.0 can load data from [CTA](https://www.cta-observatory.org/) (Cherenkov Telescope Array), [FACT](https://www.isdc.unige.ch/fact/), [H.E.S.S.](https://www.mpi-hd.mpg.de/hfm/HESS/), [MAGIC](https://magic.mpp.mpg.de/), and [VERITAS](https://veritas.sao.arizona.edu/) telescopes processed using [DL1DataHandler v0.6.0](https://github.com/cta-observatory/dl1-data-handler).
CTLearn is a package under active development to run deep learning models to analyze data from all major current and future arrays of Imaging Atmospheric Cherenkov Telescopes (IACTs). CTLearn v0.3.0 can load data from [CTA](https://www.cta-observatory.org/) (Cherenkov Telescope Array), [FACT](https://www.isdc.unige.ch/fact/), [H.E.S.S.](https://www.mpi-hd.mpg.de/hfm/HESS/), [MAGIC](https://magic.mpp.mpg.de/), and [VERITAS](https://veritas.sao.arizona.edu/) telescopes processed using [DL1DataHandler v0.7.3+](https://github.com/cta-observatory/dl1-data-handler).

## Install CTLearn

### Clone Repository with Git

Clone the CTLearn repository:
Clone the CTLearn and DL1-Data-Handler repositories:

```bash
cd </installation/path>
cd </ctlearn/installation/path>
git clone https://github.com/ctlearn-project/ctlearn.git

cd </dl1-data-handler/installation/path>
git clone https://github.com/cta-observatory/dl1-data-handler.git
```

### Install Package with Anaconda
Expand All @@ -27,33 +30,37 @@ conda env create -f </installation/path>/ctlearn/environment-<MODE>.yml

where `<MODE>` is either 'cpu' or 'gpu' (for linux systems) or 'macos' (for macOS systems), denoting the TensorFlow version to be installed. If installing the GPU version of TensorFlow, verify that your system fulfills all the requirements [here](https://www.tensorflow.org/install/install_linux#NVIDIARequirements). Note that there is no GPU-enabled TensorFlow version for macOS yet.

Finally, install CTLearn into the new conda environment with pip:
Finally, install DL1-Data-Handler and CTLearn into the new conda environment with pip:

```bash
source activate ctlearn
cd </installation/path>/ctlearn

cd <dl1-data-handler/installation/path>/dl1-data-handler
pip install --upgrade .

cd <ctlearn/installation/path>/ctlearn
pip install --upgrade .
```
NOTE for developers: If you wish to fork/clone the respository and make changes to any of the ctlearn modules, the package must be reinstalled for the changes to take effect.

The following error message due to incompatibilities between dependencies is expected and can be ignored: "ERROR: ctapipe unknown has requirement eventio==0.11.0, but you'll have eventio 0.21.2 which is incompatible."

NOTE for developers: If you wish to fork/clone the repository and edit the code, either install with `pip -e` or reinstall after making changes for them to take effect.

### Dependencies

- Python 3.7.3
- TensorFlow 1.13.1
- DL1DataHandler
- NumPy
- AstroPy
- OpenCV
- PyTables
- PyYAML
- SciPy
- Libraries used only in plotting scripts (optional)
- Matplotlib
- Pillow
- Pandas
- Scikit-learn

## Download Data

CTLearn can load and process data in the HDF5 PyTables format produced from simtel files by [DL1DataHandler](https://github.com/cta-observatory/dl1-data-handler). Instructions for how to download CTA Prod3b data processed into this format are available on the [CTA internal wiki](https://forge.in2p3.fr/projects/cta_analysis-and-simulations/wiki/Machine_Learning_for_Event_Reconstruction#Common-datasets).
CTLearn can load and process data in the HDF5 PyTables format produced from simtel files by [DL1DataHandler](https://github.com/cta-observatory/dl1-data-handler).

## Configure a Run

Expand All @@ -65,32 +72,38 @@ Specify model directory to store TensorFlow checkpoints and summaries, a timesta

### Data

Describe the data to use, including the format, list of file paths, and whether to apply preprocessing. Includes subsections for **Loading** for parameters for selecting data such as the telescope type and pre-selection cuts to apply, **Processing** for data preprocessing settings such as cropping or normalization, and **Input** for parameters of the TensorFlow Estimator input function that converts the loaded, processed data into tensors.

Data may be loaded in two ways, either event-wise in `array` mode which yields data from all telescopes in a specified array as well as auxiliary information including each telescope's position, or one image at a time in `single_tel` mode. In `array` mode, data from either a single telescope type or multiple telescope types may be loaded.

By default, each input image has a single channel indicating integrated pulse intensity per pixel.
If the option `use_peak_times` is set to `True`, an additional channel with peak pulse arrival times per pixel will be loaded.
Describe the dataset to use and relevant settings for loading and processing it. The parameters in this section are used to initialize a DL1DataReader, which loads the data files, maps the images from vectors to arrays, applies preprocessing, and returns the data as an iterator. Data can be loaded in three modes:
- Mono: single images of one telescope type
- Stereo: events of one telescope type
- Multi-stereo: events including multiple telescope types

### Image Mapping
Parameters in this section include telescope IDs to select, auxiliary parameters to return, pre-selection cuts, image mapping settings, and pre-processing to apply to the data. Image mapping is performed by the DL1DataReader and maps the 1D pixel vectors in the raw data into 2D images. The available mapping methods are oversampling, nearest interpolation, rebinning, bilinear interpolation and bicubic interpolation, image shifting, and axial addressing.
Pre-processing is performed using the DL1DataHandler Transform class.

Set parameters for mapping the 1D pixel vectors in the raw data into 2D images, including the hexagonal grid conversion algorithm to use and how much padding to apply. The available hexagonal conversion algorithms are oversampling, nearest interpolation, rebinning, bilinear interpolation and bicubic interpolation, image shifting, and axial addressing.
### Input
Set parameters of the TensorFlow Estimator input function that converts the loaded, processed data into tensors.

### Model

CTLearn works with any TensorFlow model obeying the signature `logits = model(features, params, training)` where `logits` is a vector of raw (non-normalized, pre-Softmax) predictions, `features` is a dictionary of tensors, `params` is a dictionary of training parameters and dataset metadata, and `training` is a Boolean that's True in training mode and False in testing mode. Since models in CTLearn v0.2.0 return only a single logits vector, they can perform only one classification task (e.g. gamma/hadron classification).
CTLearn works with any TensorFlow model obeying the signature `logits = model(features, params, example_description, training)` where `logits` is a vector of raw (non-normalized, pre-Softmax) predictions, `features` is a dictionary of tensors, `params` is a dictionary of model parameters, `example_description` is a DL1DataReader example description, and `training` is a Boolean that's True in training mode and False in testing mode.

Provide in this section the directory containing a Python file that implements the model and the module name (that is, the file name minus the .py extension) and name of the model function within the module. Everything in the **Model Parameters** section is directly included in the model `params`, so arbitrary configuration parameters may be passed to the provided model.
To use a custom model, provide in this section the directory containing a Python file that implements the model and the module name (that is, the file name minus the .py extension) and name of the model function within the module.

In addition, CTLearn includes three [models](models) for gamma/hadron classification. CNN-RNN and Variable Input Network perform array-level classification by feeding the output of a CNN for each telescope into either a recurrent network, or a convolutional or fully-connected network head, respectively. Single Tel classifies single telescope images using a convolutional network. All three models are built on a simple, configurable convolutional network called Basic.

The values in the data to be used as labels and lists of class names where applicable are also provided in this section.

### Model Parameters

This section in its entirety is directly included as the model `params`, enabling arbitrary configuration parameters to be passed to the provided model.

### Training

Set training parameters such as the number of validations to run and how often to evaluate on the validation set, as well as, in the **Hyperparameters** section, hyperparameters including the base learning rate and optimizer.
Set training parameters such as the training/validation split, the number of validations to run, and how often to evaluate on the validation set, as well as hyperparameters including the base learning rate and optimizer.

### Prediction

Specify prediction settings such as the path to write the prediction file.
Specify prediction settings such as the path to write the prediction file and whether to save the labels and example identifiers along with the predictions.

### TensorFlow

Expand All @@ -104,7 +117,7 @@ Run CTLearn from the command line:
CTLEARN_DIR=</installation/path>/ctlearn/ctlearn
python $CTLEARN_DIR/run_model.py myconfig.yml [--mode <MODE>] [--debug] [--log_to_file]
```
`--mode <MODE>`: Set run mode with `<MODE>` either `train` or `predict`. If not set, defaults to `train`.
`--mode <MODE>`: Set run mode with `<MODE>` as `train`, `predict`, or `load_only`. If not set, defaults to `train`.

`--debug`: Set logging level to DEBUG.

Expand All @@ -127,36 +140,19 @@ View training progress in real time with TensorBoard:
tensorboard --logdir=/path/to/my/model_dir
```

## Classes

**DataLoader and HDF5DataLoader** Load a set of IACT data and provide a generator yielding NumPy arrays of examples (data and labels) as well as additional information about the dataset. HDF5DataLoader is the specifc implementation of the abstract base class DataLoader for the DL1DataHandler v0.6.0 HDF5 format. Because it's prohibitive to store a large dataset in memory, HDF5DataLoader instead provides a method `get_example_generators()` that returns functions returning generators that yield example identifiers (run number, event number, and, in `single_tel` mode, tel id) as well as the class weights, and methods `get_example()` and `get_image()` to map these identifiers to examples of data and labels and to telescope images. HDF5DataLoader also provides methods `get_metadata()` and `get_auxiliary_data()` that return dictionaries of additional information about the dataset. A DataProcessor provided either at initialization or using the method `add_data_processor()` applies preprocessing to the examples and an ImageMapper provided at initialization maps the images.
## Inspect Data

**DataProcessor** Preprocess IACT data. DataProcessor has a method `process_example()` that accepts an example of a list of NumPy arrays of data and an integer label along with the telescope type and returns preprocessed data in the same format, and a method `get_metadata()` that returns a dictionary of information about the processed data. A DataProcessor with no options set leaves the example unchanged. Preprocessing methods implemented in CTLearn v0.2.0 include cropping an image about the shower centroid and applying logarithmic normalization.
Print dataset statistics only, without running a model:

**ImageMapper** Map vectors of pixel values (as stored in the raw data) to square camera images. This is done with the `map_image()` method that accepts a vector of pixel values and telescope type and returns the camera image converted to a square array. This is not a unique transformation for cameras with pixels laid out in a hexagonal grid, so the hexagonal conversion method is configurable. The implemented method are oversampling, nearest interpolation, rebinning, bilinear interpolation and bicubic interpolation. ImageMapper can convert data from all CTA telescope and camera combinations currently under development, as well as data from all IACTs (VERITAS, MAGIC, FACT, HESS-I and HESS-II.)

These classes may be used independently of the TensorFlow-based portion of CTLearn, e.g.:

```python
from ctlearn.data_loading import HDF5DataLoader

myfiles = ['myfile1.h5', 'myfile2.h5',...]
data_loader = HDF5DataLoader(myfiles)
train_generator, validation_generator, class_weights = data_loader.get_example_generators()
# Print a list of NumPy arrays of telescope data, a NumPy array of telescope position
# coordinates, and a binary label for the first example in the training set
example_identifiers = list(train_generator())[0]
print(data_loader.get_example(*example_identifiers))
```bash
python $CTLEARN_DIR/run_model.py myconfig.yml --mode load_only
```

## Supplementary Scripts

- **plot_classifier_values.py** Plot a histogram of gamma/hadron classification values from a CTLearn predictions file.
- **plot_roc_curves.py** Plot gamma/hadron classification ROC curves from a list of CTLearn predictions files.
- **plot_camera_image.py** Plot all cameras for all hexagonal conversion method with dummy data.
- **print_dataset_metadata.py** Print metadata for a list of ImageExtractor HDF5 files using HDF5DataLoader.
- **run_multiple_configurations.py** Generate a list of configuration combinations and run a model for each, for example, to conduct a hyperparameter search or to automate training or prediction for a set of models. Parses a standard CTLearn configuration file with two additional sections for Multiple Configurations added. Has an option to resume from a specific run in case the execution is interrupted.
- **visualize_bounding_boxes.py** Plot IACT images with overlaid bounding boxes using DataProcessor's crop method. Useful for manually tuning cropping and cleaning parameters.
- **auto_configuration.py** Fill the path information specific to your computer and run this script from a folder with any number of configuration files to automatically overwrite them.
- **summarize_results.py** Run this script from the folder containing the `runXX` folders generated by the `run_multiple_configurations.py` script to generate a `summary.csv` file with key validation metrics after training of each run.

Expand Down
Loading

0 comments on commit 258c1aa

Please sign in to comment.