100 days of AI/ML. The requirement of the challenge is to study or apply AI/ML each day for 100 days. Previously I have done this twice, as documented in Round 1) and Round 2). The third round started on September 3rd.
Today I worked on another way of outputting probablity distributions from the neural network. The approach I have been taking so far is creating a grid in the 1D output space and predicting the probability in each grid cell. The output is then constrained using cross-entropy loss. Instead I experimented with using a Gaussian Mixture Model (GMM) as the last layer in the network. This is not provided by PyTorch, so I needed to partially implement something myself. Below is an example of an output:
The important part here is to not automaticall assume this is actually a probability distribution. It could simply be a continous line peaking somewhere around the correct answer, but not without a proper probabilistic interpretation. After quite some back and forth, it looks like it works when fitting straight lines. I also had a test on some simulated galaxies, comparing to the more traditional way of outputting probability distributions. So far it seems promising, but I need to test using real data.
Read some further blog posts about mixture density networks (MDN) and large parts of the original paper Bishop 1994
Experimented more with these types of networks. The Bishop1994 paper had one very simple example, generating 1000 datapoints with two variables. For certain input values, two output values are equally likely. There are simply not enough information to fully give the output. Here the posterior should be multimodel. I have polished up the notebook. Below is a reproduction of Fig.7 in the paper
The mixture density network used an exponential to keep the width of the distribution. Alternatively, one can use 1+elu, which I saw elsewhere. Here elu is the exponential linear unit. Above zero it equals ReLU (identity function), but has a smooth transition around zero and approach -1 when going to minus infintity. I tested training the network with both approaches five times each. The result is shown below.
For this case, the elu converge faster. Further, I was reading up on Machine learning in astronomy while waiting.
Implemented a mixture density network in the distance determination pipeline, being able to compare with the traditional results. Some of the results does not make sense at all, with the network not training properly.
Today I managed to get it working. In the end, not fully sure what ended up making the difference. The results starting to make more sense when adding a non-linear layer (ReLU) directly before the MDN. This network is quite deep and I had skipped this single ReLU. Moreover, there was a significant performance problem when evaluating the MDN on the test set. Outputting the values on a grid was extremely fast. For some reason, evaluating the MDN was rather slow. This ended up becoming a serious bottleneck, since earlier I evaluated the test set metrcis after each epoch. Only doing this every 50th epoch lead to a significant speedup. This allowed for creating a sufficiently large run and evaluating the performance. The MDN now give sensible results for a simplified test, which has removed some of the results of pretraining on simulations.
Modified the part of the distance estimation which was pretraining on simulation to also be able to use a mdn. I managed to make some different runs. While working, I did not achieve better results than constructing the posterior distribution by combining man different output classes.
Wanted to look at how to input errors to the network, but without much success. Read through blog post and some other articles.
Read through the noise2noise paper.
Read the noise2void paper. In the noise2noise paper, one would not need clean examples to denoise the image, but multiple noisy realization of the same underlying image. This is not always possible to get. Quite interesting for one of our applications.
Managed to find some references on how to treat uncertainties in the neural network. The Lightweight probabilistic deep learning paper is explaining how to let the activations be probabilistic. For other approaches the weights would be interpret as Gaussians. This approach is supposed to be significantly faster and also avoid repeated forward passes, which many probabilistic methods rely on. I also attempted making another implementation of the MDN network. Evaluating the probability function is now extremely fast, but the training seems to be affected. Not sure why.
Found out that the amount of dropout used is critical for getting a good performance. For classification tasks, you can easily use 20-50% dropout for good results. I ended up using 2% in the early layers. More dropout degraded the results.
Watched interview with Jeremy Howard.
Continued tweaking of the model, experimenting with the results. The results are looking quite close to the previous ones when considering all objects. However, it looks like the result is improving compared to the previous results when only considering the best 50%.
Worked on writing an application today, which will contain a significant amount about deep learning. Is deep learning hype? I read found this article and Howard interview quite constructive. Also read on transfer learning
Quote from the paper: "As an example relevant to ICF, researchers at the National Ignition Facility (NIF) [18] have used transfer learning to classify images of different types of damage that occur on the optics at NIF. There are not enough labeled optics images to train a network from scratch, but transfer learning with a network pre-trained on ImageNet [13] produces models which classify optics damage with over 98% accuracy."
That is quite a strech of domain.
Watched Yann Lecun interview.
Read about transfer learning in astronomy and some other papers.
Looked at a video and
presentation
about transfer learning and multi-task learning.
Looked at panel debate about probabililistic networks.
One idea I explored earlier was using neural networks for density estimation. The earlier attempts, some semesters ago, was not working very well. Attempting again, I read through paper on density estimation and made an implementing in the notebook.
Read through a paper from Google AI on unsupervised learning. It was voted the best paper on ICML2019. The most interesting part was actually the style. Many papers are focusing on achieving a minor improvement on some benchmark. Here the authors was giving some proofs and a lot of tests that a general unsupervised separation was possible. They has a ridiculles number of pages in the appendix.
Also read the res2next paper
Continued with the notebook from day 218, attempting to have it working. Did not function either after attemping different things for an hour. However, late I realized at least one problem. Hopefully that will solve the issue.
Watched Regina Barzilay: Deep Learning for Cancer Diagnosis and Treatment interview. There was some interesting points, like her thinking about medical problems and the potential for using deep learning for early cancer diagnosis.
Continue working on the reweighting, doing some expressions on paper and implementing them. As shown below, I have no success yet.
Read through two papers (paper 1, paper 2) on using recursive neural networks for denoising gravitational wave observations.
Watch through Francois Chollet interview, at least partially.
Looked at the Turing Lecture talk.
Next video with Susskind in the AGI podcast series. He is a well known persons from the physics community. A bit too many vidoes in a row by now.
First test of actually using the PAUS spectras. Here I used the simulations to predict the distance from a subset of the observations. Without noise the results was better than expected from what I previously has read in the literature. A bit unecpected.
Read through a paper which used denoising autoencoders to unsupervise feature extraction from galaxy data. This is relevant for what I am doing.
Worked on writing a proposal based on deep learning methods. While doing this application for a while, today I spent 6 hours on the main project part.
Wanted to look more into multi-task learning. I read an overview paper from Sebastian Ruder.
One problem I have looked on many times is using neural networks for speeding up the calculations of simulations. Basically, you have a large training sample with noiseless examples and would like to train a network to be able to produce a billion new examples, conditioned on some parameter. Training a network, it kind of work, but the error on the output was 2-3% at best. This is not sufficiently good for our application. Finding relevant literature was not easy. In the end, I figured out the magic searchword was "interpolation".
Watched Gary Marcus: Toward a Hybrid of Deep Learning and Symbolic AI interview. It was quite interesting hearing from an insider talking about what he considered some limitations of current deep learning systems. Abstract concepts seems to be missing in deep learning networks and it unclear how to build them in.
Continued looking at the interpolation network. Instead of only outputting the value, I also returned the error. The hope was to use this to downweight problematic points. In the image below, the three columns are the training loss, test loss and the relative error. It did not work very well. I also tried other sources of tweaking.
Last full day of writing a proposal which has a strong machine learning component. Looked up references on multi-task learning, including Multi-task learning book
Continuing about obsessing about the interpolating of measurements using neural networks, looking at [paper] (https://arxiv.org/abs/1906.05661)
Also played around with the interpolation, reading up some papers.
Continuing with trying to predict the fluxes. Both using ELU and trying to reduce the prediction to using a single band.
This was after reading some paper which talked about a smooth non-linear transition would work better than a simple ReLU. That did not seem to be the case.
Looked into various sources of tranfer learning, including the paper A survey on transfer learning.
Not very effective. I was searching if there was some new trends, looking at various articles.
Read through quite some papers, being away from the laptop today.
In Shapely framework the authors introduce a way to determine if improved performance comes from a change in the algorithms or the data. Quite technical.
Foggy scenes tested the improvements when degrading scenes with artificial fog. This is possible when having a 3D model and a simple model of the fog, where the transparency is distance dependent.
Harware acceleration. Not the most interesting paper. The compared running on CPU or GPUs. The paper did not give a good impression.
Watched Watson interview on how they constructed Watson and beat the best human player in Jeopardy. He had an interesting perspective on how the project was run. With a difficult task, it is easy to assume achieving the goal would require inventing something completely new. Instead they mostly used existing technology and let different groupd invent on separate parts.
I wanted to test a specific transfer learning technique. For this I needed a simple example. For this I constructed a CNN which could determine the frequency of a wave. Testing this simple example, it did not work at all. Which should not be the case. At the end of 1.5 hours I had removed all complications, but the network was still not working.
Update: In the end, the problem was located in two matrices in the loss function being broadcasted differently than expected. Not too easy to detect, since I directly afterwards did a mean of the results.
Worked on adapting a new set of simulations and then pretrained the network with these. It was looking good so far.
In the middle of the third round between 200 and 300 days, I started loosing the interest. Around the same time I worked on finishing up a research paper using deep learning techniques. Focusing on finishing this paper felt more productive at the time. This resulted in a very long time where I did not follow this good habit of working on deep learning each day. One pandemic later, I am finally back again.
Consider having a distribution and wanting to create a density estimator (like the KDE). Is it possible todo this with a neural network? Previous attempts that I had on this failed. By now I tested creating a neural network which gets a single constant input. The output is given by using a mixture density network. Below is a plot showing how this fits.
Another problem with the MDN is when having a multi-variate distribution. One can in a simple way return multiple independent predictions. The problem is when there are correlation between the various predictions, which is very often the case. Below is an example of two correlated Gaussians.
In general there were not a lot of useful literature on the topic. The technical note Training Mixture Density Networks with full covariance matrices had some useful tips. Some (Cholesky factorization) was along what I though of doing, but it included also some other ideas (eq.10). Tomorrow I hope to make an implementation.
Worked on actually making the implementation, fitting the MDN to a 2D Gaussian with a correlation between the variable. The result
shows some more work is needed.
Worked on installing tensorboards. Should be simple, but for some reason it was not willing to connect.
Continued working on tensorboard. After some beating I had it running. Uploading to tensorboard.dev is a neat feature.
[VIDEO GTC intro]
Experimenting to find the problems with the GMM with covariance. The code looks fine. When reducing to one component, I manage to get the result below
which visually looks quite similar (did not test further). What is going on will be the topic for another day.
Worked on reading through the webpage that Christian sent.
Got the multi-dimentional MDN working and also worked on a calibration network for astronomical images.
Worked on preparing the input when systematically removing individual images.
Trained an autoencoder to remove noise in the zero-points.
Listened to interview on Spotify on MLDL in Heiniken.
Started testing the zero-point auto-encoder by downloading the data to be corrected and writing the code for managing joining the data. Unfortunately I ended up having a problem where some data was not exactly what was expected, which took some time to figure out.
Continued with the auto-encoder. It only gives a quite small improvement on the final numbers.
Listened to some interview on how NSF is investing into deep learning.
Created a conference poster on the photometric redshift with deep learning paper I published last year.
Extended the zero-point auto-encoder to work with multiple bands. In the end one only find a tiny improvement. I presented these results to the PAUS collaboration.
Watched "Using Deep Learning and Simulation to Teach Robots Manipulation in Complex Environments" from GTC2021 and the start of another video.
Experimented more with tensorboard. Among other things, I had a problem with using tensorboards inline in the notebook which did get fixed before deleting the content of the log directory.
Experimented with trying to map out instrumental zero-points by directly predicting the zero-point per star. Below is a pattern
which shows up for all image IDs. This trend is probably because the network has not been train enough/correctly and so far it focus on getting the correct zero-point per image.
The problem yesterday seems to come from how the training set is sampled. We have 12096085 observations divided on 204920 images. When selecting 1000 galaxies in a batch, each of the stars most likely belong to a different image. The simplest way for the network to get a good fit is adjusting the zero-point for each image. This means it will never learn the pattern as a function of the position on the CCD.
Instead I have switched to loading all stars (~200) in a mosaic at once. Writing this custom dataloader was what ended up taking most time, since this needs to be computationally efficient. By now the network is training again.
Continued looking at the AI for medicine course from coursera. Currently in the first week. When getting to the weighting of underrepresented classes I had a look at an astronomical dataset with exactly this problem.
First spent some time attempting to get the visual debugger in jupyter lab
working. I have seen it beeing installed by default, but never found out how it
worked. It turnes out you need to install xeus-python.
Installing the binary worked, but I did not get how to get one of the conda
environments listed in Jupyterlab. There did however exist a "xpython" option
in the list, pointing to an anaconda installation. I could play around with
this. Also, the mamba package manager
is a faster drop-in replacement for conda.
More importantly, I continued working on mapping out the zero-point variations accross the CCD. Creating a custom "collate_fn" function, I can now process multiple mosaics at once. Running the code with 10 mosaics was for some reason much faster than expected. By now the CCD pattern is different for different images.
Continued working on predicting the zero-points. While one should predict this on the position of the galaxies, I tested predicting a 50x50 grid and then averaged over all positions to get an image level prediction. This could then easily be compared with a classical algorithm. Plotting the histogram of image zero-point, I find that
where the neural network prediction tends to center around the median. That seems to be a problem with using a L1 loss. Tomorrow I will attempt using a mixture density network.
Setting up some broken environment again and testing converting the output to a mixture density network. By now it at least trains. Later I will try to optimize this further.
Read through deblending paper before tomorrows journal club. Deblending paper
Watch the GTC 2021 keynote.
Finished week 1 of AI for medical diagnosis on coursera.
And then week 2. It is quite simple.
Kind of finished week 3 and the first course. Because of some technical issue coursera does not let me submit the coursework right now. A bit annoying.
Worked through week 1 and 2 of the second course.
Trained the MDN network for zero-points. While training I ended up going through some blog posts about MLOps.
Watched on of the GTC talks.
Worked on the zero-point prediction with MDN. By now it gives a reasonable distribution. Comparing the predictions directly, there is a lot of scatter, but it is uncertain what level is expected from the errors.