Add PAE ranking method #13

hannah-rae · 2021-06-07T15:18:29Z

No description provided.

wkiri · 2021-09-13T18:22:17Z

@bdubayah Thank you for this! I ran it on the planetary rover Navcam images. We don't have the right CUDA drivers on our machine, so I think it is falling back to CPU mode (which is what we want). Do I need to provide any additional arguments to make this happen? I got a lot of info/warning messages on my console as follows. However, I did get selection results. Can I trust them? If these errors are harmless (just indicating fallback to CPU mode), could they be caught/suppressed (or moved to the log file) and replaced with a single message indicating "GPU support unavailable, falling back to CPU mode"?

2021-09-13 11:12:22.591780: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:12:22.591841: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Loading data_to_fit
Loading data_to_score
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 15.73it/s]
Outlier detection:  50%|█████████████████                 | 2/4 [00:04<00:03,  1.88s/it]2021-09-13 11:13:01.516518: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-13 11:13:01.621985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:0a:00.0 name: Tesla M60 computeCapability: 5.2
coreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s
2021-09-13 11:13:01.622467: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.623958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.625606: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.629197: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-13 11:13:01.631525: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-13 11:13:01.632709: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.633911: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635503: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635766: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-13 11:13:01.636569: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 11:13:01.639303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 11:13:01.639371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]     
                                                                                        
2021-09-13 11:13:02.467535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-13 11:13:02.468648: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3196415000 Hz
                                                                                        
2021-09-13 11:13:23.519885: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
                                                                                        
Outlier detection: 100%|██████████████████████████████████| 4/4 [00:33<00:00,  8.28s/it]

bdubayah · 2021-09-13T18:56:31Z

@wkiri The scores should be fine - If it wasn't training it would give all nans for scores. I thought I had disabled logging/warnings as much as possible but I'm going to go back in and see if it can be locked down further/try to add a more descriptive message - thanks for finding!

wkiri · 2021-09-14T18:58:13Z

Hm, I looked more deeply into the results files, and indeed, I see "nan" for all of the scores. So I suspect it is not working correctly in CPU mode. Could you look into this? Here is the file: https://github.com/nasaharvest/dora/blob/master/exp/planetary_rover/results/pae-latent_dim%3D5/selections-pae.csv

Here is how to reproduce the experiment I did (but I recommend (1) running on a machine without GPUs and (2) commenting out all algs in the config file except PAE to reduce your time waiting for results):

$ python3 dora_exp_pipeline/dora_exp.py -o exp/planetary_rover/results -l planetary-last10sols.log exp/planetary_rover/planetary-last10sols.config

wkiri · 2021-09-14T19:00:02Z

In the short term, if you can generate updated PAE results while running on a GPU machine for this data set, that would work too! :) (You could just check in an updated selections-pae.csv file).

bdubayah · 2021-09-14T19:52:47Z

I think the nans are actually coming from the flow training failing because the pixel values aren't being scaled to [0-1] (I've been training on a CPU most of the time and haven't had issues). I talked about adding this to the PAE in the meeting yesterday, but after giving it some thought today I think the best way is to add a pixel normalization parameter to the image data loader. Also, I assume the experiment is being run on a larger set of images than just the ones in the sample_data dir? Are these available anywhere so I can make sure it runs?

wkiri · 2021-09-14T21:59:26Z

@bdubayah Yes, the normalization issue might be the culprit!

The files for this experiment are on the JPL servers. See config file here:
https://github.com/nasaharvest/dora/tree/master/exp/planetary_rover/

If you don't have JPL access, I can zip up the files and send them to you later today.

bdubayah · 2021-09-14T22:04:04Z

@wkiri I don't think I have JPL access so that would be great if you could send them over! For what it's worth, the model converges for me even on the really small sample dataset (once I added normalization), but it would be nice just to confirm on the bigger dataset too. I'll be able to push out the fix a bit later today.

wkiri · 2021-09-15T00:39:31Z

@bdubayah Great, I just sent you an email with the (larger set of) image files.

- add pixel normalization to image laoder - add pae config option disable flow - re-run planetary rover experiment

bdubayah · 2021-09-15T13:18:23Z

Hi @wkiri, I added the changes to the PAE and re-ran the experiment (see most recent commit). My only concern is that I added an additional option to the flattened pixel values extractor to normalize pixels to [0,1] and the MDRs for the algorithms decreased a little (https://github.com/nasaharvest/dora/blob/5cf124cea699d2ffc1d7c3d6156a25e667e7beb5/exp/planetary_rover/results/comparison_plot_combined.png). Not sure if this is expected or not - I can move the normalization to the PAE, I was just thinking there might be some data types where the user might not want values normalized to [0,1].

wkiri · 2021-09-15T19:53:23Z

@bdubayah Thanks! It is not surprising that the numeric scores would change for some algorithms (esp. those that report reconstruction error, like PCA or DEMUD) but I am surprised that the order of selections has changed quite a bit. The MDRs have not only decreased, but there is much less performance separation between algorithms. The order has even changed for "random", which suggest to me that the differences may be due to Python environment/packages rather than the normalization. This may be related to issue #44 .

I ran with just the normalization employed and I get the same results on all non-PAE algorithms as without normalization. The PAE algorithm's performance improved significantly (and the scores are not NaNs).

I think you can proceed to PR/merge this fix. If anyone does not want pixel normalization (which doesn't affect most algorithms anyway), we could discuss/revert that global change if needed.

bdubayah · 2021-09-18T01:46:38Z

Todo at this point:

Adjust latent dim based on number of features
Add more descriptive messages about gpu (i.e. resolve GPU warnings/messages) - maybe @PaHorton's pytorch implementation would resolve this?
Add no- normalizing flow option for convolutional (issues with loss propagation)

hannah-rae · 2021-09-29T17:04:56Z

@bdubayah are you still working on the above tasks or is this ready to be closed?

bdubayah · 2021-09-29T17:29:55Z

@hannah-rae Still working on them!

hannah-rae assigned bdubayah Jun 7, 2021

bdubayah added a commit that referenced this issue Jun 21, 2021

Add pae outlier detection (working with tabular data) #13

c69c1ea

bdubayah added a commit that referenced this issue Jul 19, 2021

Add pae outlier detection (working with tabular data) #13

115eaf3

hannah-rae added the algorithms label Aug 3, 2021

bdubayah added a commit that referenced this issue Aug 11, 2021

Add convolutional autoencoder #49 #13

0479d06

bdubayah added a commit that referenced this issue Sep 9, 2021

Add PAE ranking and image directory data loader (#13)

6e35e35

bdubayah closed this as completed Sep 9, 2021

wkiri reopened this Sep 13, 2021

bdubayah added a commit that referenced this issue Sep 15, 2021

PAE fixes for planetary rover data #13

5cf124c

- add pixel normalization to image laoder - add pae config option disable flow - re-run planetary rover experiment

wkiri added a commit that referenced this issue Sep 15, 2021

Update planetary rover results with pixel normalization (#13,#60)

cfeb4be

wkiri added a commit that referenced this issue Sep 15, 2021

Update PAE results and figure for planetary rover (#13,#60).

f433ff3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PAE ranking method #13

Add PAE ranking method #13

hannah-rae commented Jun 7, 2021

wkiri commented Sep 13, 2021

bdubayah commented Sep 13, 2021

wkiri commented Sep 14, 2021

wkiri commented Sep 14, 2021

bdubayah commented Sep 14, 2021

wkiri commented Sep 14, 2021

bdubayah commented Sep 14, 2021

wkiri commented Sep 15, 2021

bdubayah commented Sep 15, 2021

wkiri commented Sep 15, 2021

bdubayah commented Sep 18, 2021 •

edited by hannah-rae

Loading

hannah-rae commented Sep 29, 2021

bdubayah commented Sep 29, 2021

Add PAE ranking method #13

Add PAE ranking method #13

Comments

hannah-rae commented Jun 7, 2021

wkiri commented Sep 13, 2021

bdubayah commented Sep 13, 2021

wkiri commented Sep 14, 2021

wkiri commented Sep 14, 2021

bdubayah commented Sep 14, 2021

wkiri commented Sep 14, 2021

bdubayah commented Sep 14, 2021

wkiri commented Sep 15, 2021

bdubayah commented Sep 15, 2021

wkiri commented Sep 15, 2021

bdubayah commented Sep 18, 2021 • edited by hannah-rae Loading

hannah-rae commented Sep 29, 2021

bdubayah commented Sep 29, 2021

bdubayah commented Sep 18, 2021 •

edited by hannah-rae

Loading