Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colab demo doesn't work #210

Open
Rajat-Vishwa opened this issue Aug 29, 2024 · 2 comments
Open

Colab demo doesn't work #210

Rajat-Vishwa opened this issue Aug 29, 2024 · 2 comments

Comments

@Rajat-Vishwa
Copy link

Rajat-Vishwa commented Aug 29, 2024

Running the colab example initially gives #205. (COLMAP fails to execute)
#205 is solved by adding the following before installing COLMAP,

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!dpkg -i cuda-repo-ubuntu2204-11-7-local_11.7.0-515.43.04-1_amd64.deb
!cp /var/cuda-repo-ubuntu2204-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
!apt-get update
!apt-get -y install cuda-11-7
!update-alternatives --set cuda /usr/local/cuda-11.7

This fixes COLMAP and it is able to run the preprocessing untill it throws error on the training step,

# @title { vertical-output: true }
%cd /content/neuralangelo
GROUP = "test_exp"
NAME = "lego"
!torchrun --nproc_per_node=1 train.py \
    --logdir=logs/{GROUP}/{NAME} \
    --show_pbar \
    --config=projects/neuralangelo/configs/custom/lego.yaml \
    --data.readjust.scale=0.5 \
    --max_iter=20000 \
    --validation_iter=99999999 \
    --model.object.sdf.encoding.coarse2fine.step=200 \
    --model.object.sdf.encoding.hashgrid.dict_size=19 \
    --optim.sched.warm_up_end=200 \
    --optim.sched.two_steps=[12000,16000]

ERROR :

[W829 13:19:57.835930806 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Training with 1 GPUs.
Using random seed 0
Make folder logs/test_exp/lego
* checkpoint:
   * save_epoch: 9999999999
   * save_iter: 20000
   * save_latest_iter: 9999999999
   * save_period: 9999999999
   * strict_resume: True
* cudnn:
   * benchmark: True
   * deterministic: False
* data:
   * name: dummy
   * num_images: None
   * num_workers: 4
   * preload: True
   * readjust:
      * center: [0.0, 0.0, 0.0]
      * scale: 0.5
   * root: datasets/lego_ds2
   * train:
      * batch_size: 2
      * image_size: [801, 801]
      * subset: None
   * type: projects.neuralangelo.data
   * use_multi_epoch_loader: True
   * val:
      * batch_size: 2
      * image_size: [300, 300]
      * max_viz_samples: 16
      * subset: 4
* image_save_iter: 9999999999
* inference_args:
* local_rank: 0
* logdir: logs/test_exp/lego
* logging_iter: 9999999999999
* max_epoch: 9999999999
* max_iter: 20000
* metrics_epoch: None
* metrics_iter: None
* model:
   * appear_embed:
      * dim: 8
      * enabled: False
   * background:
      * enabled: True
      * encoding:
         * levels: 10
         * type: fourier
      * encoding_view:
         * levels: 3
         * type: spherical
      * mlp:
         * activ: relu
         * activ_density: softplus
         * activ_density_params:
         * activ_params:
         * hidden_dim: 256
         * hidden_dim_rgb: 128
         * num_layers: 8
         * num_layers_rgb: 2
         * skip: [4]
         * skip_rgb: []
      * view_dep: True
      * white: False
   * object:
      * rgb:
         * encoding_view:
            * levels: 3
            * type: spherical
         * mlp:
            * activ: relu_
            * activ_params:
            * hidden_dim: 256
            * num_layers: 4
            * skip: []
            * weight_norm: True
         * mode: idr
      * s_var:
         * anneal_end: 0.1
         * init_val: 3.0
      * sdf:
         * encoding:
            * coarse2fine:
               * enabled: True
               * init_active_level: 4
               * step: 200
            * hashgrid:
               * dict_size: 19
               * dim: 8
               * max_logres: 11
               * min_logres: 5
               * range: [-2, 2]
            * levels: 16
            * type: hashgrid
         * gradient:
            * mode: numerical
            * taps: 4
         * mlp:
            * activ: softplus
            * activ_params:
               * beta: 100
            * geometric_init: True
            * hidden_dim: 256
            * inside_out: False
            * num_layers: 1
            * out_bias: 0.5
            * skip: []
            * weight_norm: True
   * render:
      * num_sample_hierarchy: 4
      * num_samples:
         * background: 32
         * coarse: 64
         * fine: 16
      * rand_rays: 512
      * stratified: True
   * type: projects.neuralangelo.model
* nvtx_profile: False
* optim:
   * fused_opt: False
   * params:
      * lr: 0.001
      * weight_decay: 0.01
   * sched:
      * gamma: 10.0
      * iteration_mode: True
      * step_size: 9999999999
      * two_steps: [12000, 16000]
      * type: two_steps_with_warmup
      * warm_up_end: 200
   * type: AdamW
* pretrained_weight: None
* source_filename: projects/neuralangelo/configs/custom/lego.yaml
* speed_benchmark: False
* test_data:
   * name: dummy
   * num_workers: 0
   * test:
      * batch_size: 1
      * is_lmdb: False
      * roots: None
   * type: imaginaire.datasets.images
* timeout_period: 9999999
* trainer:
   * amp_config:
      * backoff_factor: 0.5
      * enabled: False
      * growth_factor: 2.0
      * growth_interval: 2000
      * init_scale: 65536.0
   * ddp_config:
      * find_unused_parameters: False
      * static_graph: True
   * depth_vis_scale: 0.5
   * ema_config:
      * beta: 0.9999
      * enabled: False
      * load_ema_checkpoint: False
      * start_iteration: 0
   * grad_accum_iter: 1
   * image_to_tensorboard: False
   * init:
      * gain: None
      * type: none
   * loss_weight:
      * curvature: 0.0005
      * eikonal: 0.1
      * render: 1.0
   * type: projects.neuralangelo.trainer
* validation_iter: 99999999
* wandb_image_iter: 10000
* wandb_scalar_iter: 100
cudnn benchmark: True
cudnn deterministic: False
Setup trainer.
Using random seed 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/neuralangelo/train.py", line 104, in <module>
[rank0]:     main()
[rank0]:   File "/content/neuralangelo/train.py", line 79, in main
[rank0]:     trainer = get_trainer(cfg, is_inference=False, seed=args.seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/utils/get_trainer.py", line 32, in get_trainer
[rank0]:     trainer = trainer_lib.Trainer(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/trainer.py", line 26, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/projects/nerf/trainers/base.py", line 28, in __init__
[rank0]:     super().__init__(cfg, is_inference=is_inference, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 50, in __init__
[rank0]:     self.model = self.setup_model(cfg, seed=seed)
[rank0]:   File "/content/neuralangelo/imaginaire/trainers/base.py", line 116, in setup_model
[rank0]:     lib_model = importlib.import_module(cfg.model.type)
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/model.py", line 21, in <module>
[rank0]:     from projects.neuralangelo.utils.modules import NeuralSDF, NeuralRGB, BackgroundNeRF
[rank0]:   File "/content/neuralangelo/projects/neuralangelo/utils/modules.py", line 16, in <module>
[rank0]:     import tinycudann as tcnn
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/__init__.py", line 9, in <module>
[rank0]:     from tinycudann.modules import free_temporary_memory, NetworkWithInputEncoding, Network, Encoding
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/tinycudann/modules.py", line 51, in <module>
[rank0]:     _C = importlib.import_module(f"tinycudann_bindings._{cc}_C")
[rank0]:   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]: ImportError: /usr/local/lib/python3.10/dist-packages/tinycudann_bindings/_75_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
E0829 13:20:04.141000 139155706921600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 31457) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-29_13:20:04
  host      : a8e8c22c1e57
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 31457)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Tried switching the cuda versions by using

!sudo update-alternatives --config cuda 
There are 3 choices for the alternative cuda (providing /usr/local/cuda).

  Selection    Path                  Priority   Status
------------------------------------------------------------
  0            /usr/local/cuda-12.2   122       auto mode
  1            /usr/local/cuda-11.7   117       manual mode
* 2            /usr/local/cuda-11.8   118       manual mode
  3            /usr/local/cuda-12.2   122       manual mode

Press <enter> to keep the current choice[*], or type selection number: 2

But it is still doesn't work.

@amrzv
Copy link

amrzv commented Sep 1, 2024

Colab notebook is not accessible, requires an access.

@Rajat-Vishwa
Copy link
Author

Colab notebook is not accessible, requires an access.

Seems like they took it down.
You can a find a copy of the original demo notebook here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants