Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PAE ranking method #13

Open
hannah-rae opened this issue Jun 7, 2021 · 13 comments
Open

Add PAE ranking method #13

hannah-rae opened this issue Jun 7, 2021 · 13 comments
Assignees

Comments

@hannah-rae
Copy link
Contributor

No description provided.

@wkiri
Copy link
Collaborator

wkiri commented Sep 13, 2021

@bdubayah Thank you for this! I ran it on the planetary rover Navcam images. We don't have the right CUDA drivers on our machine, so I think it is falling back to CPU mode (which is what we want). Do I need to provide any additional arguments to make this happen? I got a lot of info/warning messages on my console as follows. However, I did get selection results. Can I trust them? If these errors are harmless (just indicating fallback to CPU mode), could they be caught/suppressed (or moved to the log file) and replaced with a single message indicating "GPU support unavailable, falling back to CPU mode"?

2021-09-13 11:12:22.591780: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:12:22.591841: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Loading data_to_fit
Loading data_to_score
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 15.73it/s]
Outlier detection:  50%|█████████████████                 | 2/4 [00:04<00:03,  1.88s/it]2021-09-13 11:13:01.516518: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-13 11:13:01.621985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:0a:00.0 name: Tesla M60 computeCapability: 5.2
coreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s
2021-09-13 11:13:01.622467: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.623958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.625606: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.629197: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-13 11:13:01.631525: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-13 11:13:01.632709: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.633911: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635503: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635766: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-13 11:13:01.636569: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 11:13:01.639303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 11:13:01.639371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]     
                                                                                        
2021-09-13 11:13:02.467535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-13 11:13:02.468648: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3196415000 Hz
                                                                                        
2021-09-13 11:13:23.519885: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
                                                                                        
Outlier detection: 100%|██████████████████████████████████| 4/4 [00:33<00:00,  8.28s/it]

@wkiri wkiri reopened this Sep 13, 2021
@bdubayah
Copy link
Contributor

@wkiri The scores should be fine - If it wasn't training it would give all nans for scores. I thought I had disabled logging/warnings as much as possible but I'm going to go back in and see if it can be locked down further/try to add a more descriptive message - thanks for finding!

@wkiri
Copy link
Collaborator

wkiri commented Sep 14, 2021

Hm, I looked more deeply into the results files, and indeed, I see "nan" for all of the scores. So I suspect it is not working correctly in CPU mode. Could you look into this? Here is the file: https://github.com/nasaharvest/dora/blob/master/exp/planetary_rover/results/pae-latent_dim%3D5/selections-pae.csv

Here is how to reproduce the experiment I did (but I recommend (1) running on a machine without GPUs and (2) commenting out all algs in the config file except PAE to reduce your time waiting for results):

$ python3 dora_exp_pipeline/dora_exp.py -o exp/planetary_rover/results -l planetary-last10sols.log exp/planetary_rover/planetary-last10sols.config

@wkiri
Copy link
Collaborator

wkiri commented Sep 14, 2021

In the short term, if you can generate updated PAE results while running on a GPU machine for this data set, that would work too! :) (You could just check in an updated selections-pae.csv file).

@bdubayah
Copy link
Contributor

I think the nans are actually coming from the flow training failing because the pixel values aren't being scaled to [0-1] (I've been training on a CPU most of the time and haven't had issues). I talked about adding this to the PAE in the meeting yesterday, but after giving it some thought today I think the best way is to add a pixel normalization parameter to the image data loader. Also, I assume the experiment is being run on a larger set of images than just the ones in the sample_data dir? Are these available anywhere so I can make sure it runs?

@wkiri
Copy link
Collaborator

wkiri commented Sep 14, 2021

@bdubayah Yes, the normalization issue might be the culprit!

The files for this experiment are on the JPL servers. See config file here:
https://github.com/nasaharvest/dora/tree/master/exp/planetary_rover/

If you don't have JPL access, I can zip up the files and send them to you later today.

@bdubayah
Copy link
Contributor

@wkiri I don't think I have JPL access so that would be great if you could send them over! For what it's worth, the model converges for me even on the really small sample dataset (once I added normalization), but it would be nice just to confirm on the bigger dataset too. I'll be able to push out the fix a bit later today.

@wkiri
Copy link
Collaborator

wkiri commented Sep 15, 2021

@bdubayah Great, I just sent you an email with the (larger set of) image files.

bdubayah added a commit that referenced this issue Sep 15, 2021
- add pixel normalization to image laoder
- add pae config option disable flow
- re-run planetary rover experiment
@bdubayah
Copy link
Contributor

Hi @wkiri, I added the changes to the PAE and re-ran the experiment (see most recent commit). My only concern is that I added an additional option to the flattened pixel values extractor to normalize pixels to [0,1] and the MDRs for the algorithms decreased a little (https://github.com/nasaharvest/dora/blob/5cf124cea699d2ffc1d7c3d6156a25e667e7beb5/exp/planetary_rover/results/comparison_plot_combined.png). Not sure if this is expected or not - I can move the normalization to the PAE, I was just thinking there might be some data types where the user might not want values normalized to [0,1].

@wkiri
Copy link
Collaborator

wkiri commented Sep 15, 2021

@bdubayah Thanks! It is not surprising that the numeric scores would change for some algorithms (esp. those that report reconstruction error, like PCA or DEMUD) but I am surprised that the order of selections has changed quite a bit. The MDRs have not only decreased, but there is much less performance separation between algorithms. The order has even changed for "random", which suggest to me that the differences may be due to Python environment/packages rather than the normalization. This may be related to issue #44 .

I ran with just the normalization employed and I get the same results on all non-PAE algorithms as without normalization. The PAE algorithm's performance improved significantly (and the scores are not NaNs).

I think you can proceed to PR/merge this fix. If anyone does not want pixel normalization (which doesn't affect most algorithms anyway), we could discuss/revert that global change if needed.

@bdubayah
Copy link
Contributor

bdubayah commented Sep 18, 2021

Todo at this point:

  • Adjust latent dim based on number of features
  • Add more descriptive messages about gpu (i.e. resolve GPU warnings/messages) - maybe @PaHorton's pytorch implementation would resolve this?
  • Add no- normalizing flow option for convolutional (issues with loss propagation)

@hannah-rae
Copy link
Contributor Author

@bdubayah are you still working on the above tasks or is this ready to be closed?

@bdubayah
Copy link
Contributor

@hannah-rae Still working on them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants