Measuring the shape and brightness of galaxies with neural networks.
Harvard University
Class: CS 109B — Advanced Topics in Data Science
Project Advisors: Douglas Finkbeiner and Jun Yin
Deliverables: Code Report and an oral presentation
As astronomers collect more and more image data, there is a need for further development of automated, reliable, and fast analysis methods. Our work is a proof-of-concept study of deep neural network architectures to estimate galaxy parameters from simulated data. By showing that modern data-driven approaches can succeed in this problem, we hope to open the door to future work on real galaxies, including edge cases that traditional model-fitting methods handle poorly.
We use GalSim to generate the images of galaxies, following these steps:
- Sérsic profile: Define a galaxy's Sérsic profile parametrized by the Sérsic index and the radius that encloses half of the total flux
- Flux & shear: Add flux and shear (defined by ellipticity and orientation) to complete the galaxy definition
- PSF: Convolve the galaxy profile with the Point Spread Function, which is determined by the telescope optics and the atmosphere (for ground-based telescopes)
- Noise: Add Poisson noise (i.e. detected photoelectrons) and Gaussian noise (i.e. read noise) to the generated the image
- Signal-to-noise ratio: Compute the signal-to-noise ratio (SNR) based on the pixel values and the noise level (assuming faint galaxies). Preserve only the images with SNR between 10 and 100.
Our goal is to estimate the following five parameters from image data, which are recorded alongside the generated images:
- Sérsic profile: determined by the Sérsic index and the Sérsic radius
- Galaxy flux
- Reparameterized ellipticity and orientation: g1 and g2
As part of the analysis, we created an interactive web app that laid the foundation for understanding the data. We used it throughout our project to study the relationship between galaxy parameters and the resulting images.
The project team pursued multiple approaches to model building:
An autoencoder has two potential advantages:
- First, an autoencoder is learning from both the labels and the noiseless images and thereby might incorporate more information about the underlying relationships into the network.
- Secondly, an autoencoder can be used not only to predict the labels but also to denoise and reconstruct the images.
Using a small subset of the data we run AutoKeras, an AutoML tool, to quickly test vanilla CNNs, ResNets, and Xception networks with different complexities, regularization, and normalization parameters. The search is guided by Bayesian optimization with Gaussian Processes.
We pick several key hyperparameters of the best Xception CNN model, expand their range, and evaluate the effect using a small portion of the data. This allows us to significantly reduce the model size, decrease the training time, while simultaneously maintaining or even improving predictive performance.
Informed by a large gap between performance metrics for noiseless and noisy data, we test a two-stage pipeline described in Madireddy (2019) that uses a separate denoising network as the first step. We follow a similar approach by implementing two state-of-the-art algorithms for image denoising and restoration:
- EDSR (Enhanced Deep Residual Networks)
- RDN (Residual Dense Networks)
We train these models on a separate dataset of simulated galaxies, then apply them to our noisy inputs and proceed with an Xception CNN model.
We perform a series of comparisons between the estimated vs true parameters on the test dataset, analyzing the results in the report:
The conventional baseline approach in our case is to carry out non-linear optimization for five parameters with the noisy galaxy image as the input. We use the L-BFGS-B
bounded optimization algorithm. The objective function we minimize takes the noisy image, iteratively generates new images using GalSim, and calculates the loss as the negative log-likelihood of the image given the current iteration of parameter values.
Our neural network models are estimators for five fixed yet unknown parameters. To understand what are the smallest errors we can expect, we compare our results to the lower bound on the variance of an unbiased estimator expressed by the Cramér–Rao bound.
The project is computationally intensive, requiring GPU for running the experiments. The code is built using:
- Tensorflow
- AutoKeras
- HiPlot
- GalSim
- and a standard Python data science stack including NumPy, pandas, scikit-learn, matplotlib, Jupyter Notebook.
Use the provided conda environment specification to satisfy all dependencies:
$ conda env create -f environment.yml
$ conda activate galaxies
The interactive web app additionally makes use of Streamlit and is deployed to the Streamlit Community Cloud (originally on Heroku). The corresponding poetry
environment configuration files are provided in the app directory.
.
├── EDA
│ ├── app # Interactive web app
│ └── EDA.ipynb # Exploratory data analysis
├── Project
│ ├── autoencoder*.ipynb # Autoencoder building and training
│ ├── baseline.ipynb # Baseline model and evaluation
│ ├── final_report.ipynb # Final code report
│ └── presentation.ipynb # Slides for the oral presentation
├── data
│ ├── datasets.ipynb # Description of datasets used
│ ├── gen_test.py # Test dataset generation
│ └── generator.py # Train and validation set generation
├── experiments
│ ├── autokeras # AutoML experiments and saved models
│ ├── denoising # Implementation of denoising pipelines
│ └── gridsearch # Hyperparameter optimization
├── models
│ ├── xception.ipynb # Xception CNN architecture
│ └── xception_data* # Training and saved models
└── environment.yml # Conda environment