You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looks like the GPU in colab is not being engaged. Tried using A100, V100, T4 GPU, and TPU hardware settings in colab. command: python train.py spacetimeformer mnist --embed_method spatio-temporal --local_self_attn full --local_cross_attn full --global_self_attn full --global_cross_attn full --run_name mnist_spatiotemporal --context_points 10
Error trace: 2023-12-30 20:47:30.093968: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-30 20:47:30.094027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-30 20:47:30.095405: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-30 20:47:31.265649: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Using default wandb log dir path of ./data/STF_LOG_DIR. This can be adjusted with the environment variable STF_LOG_DIR`
Forecaster
L2: 1e-06
Linear Window: 0
Linear Shared Weights: False
RevIN: False
Decomposition: False
GlobalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
GlobalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
Using Embedding: spatio-temporal
Time Emb Dim: 6
Space Embedding: True
Time Embedding: True
Val Embedding: True
Given Embedding: True
Null Value: None
Pad Value: None
Reconstruction Dropout: Timesteps 0.05, Standard 0.1, Seq (max len = 5) 0.2, Skip All Drop 1.0
*** Spacetimeformer (v1.5) Summary: ***
Model Dim: 200
FF Dim: 800
Enc Layers: 3
Dec Layers: 3
Embed Dropout: 0.2
FF Dropout: 0.3
Attn Out Dropout: 0.0
Attn Matrix Dropout: 0.0
QKV Dropout: 0.0
L2 Coeff: 1e-06
Warmup Steps: 0
Normalization Scheme: batch
Attention Time Windows: 1
Shifted Time Windows: False
Position Emb Type: abs
Recon Loss Imp: 0.0
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing Trainer(accelerator='dp') has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy='dp') instead.
rank_zero_deprecation(
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:597: UserWarning: 'dp' is not supported on CPUs, hence setting strategy='ddp'.
rank_zero_warn(f"{strategy_flag!r} is not supported on CPUs, hence setting strategy='ddp'.")
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: max_epochs was not set. Setting it to 1000 epochs. To train without an epoch limit, set max_epochs=-1.
rank_zero_warn(
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1823: PossibleUserWarning: GPU available but not used. Set accelerator and devices using Trainer(accelerator='gpu', devices=1).
rank_zero_warn( Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
Traceback (most recent call last):
File "/content/spacetimeformer/spacetimeformer/train.py", line 869, in
main(args)
File "/content/spacetimeformer/spacetimeformer/train.py", line 849, in main
trainer.fit(forecaster, datamodule=data_module)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1218, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 172, in setup
self.configure_ddp()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 294, in configure_ddp
self.model = self._setup_model(LightningDistributedModule(self.model))
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 178, in _setup_model
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 809, in init
self._ddp_init_helper(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1140, in _ddp_init_helper
self._passing_sync_batchnorm_handle(self.module)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2072, in _passing_sync_batchnorm_handle
self._log_and_throw(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
raise err_type(err_msg)
ValueError: SyncBatchNorm layers only work with GPU modules`
The text was updated successfully, but these errors were encountered:
Looks like the GPU in colab is not being engaged. Tried using A100, V100, T4 GPU, and TPU hardware settings in colab. command:
python train.py spacetimeformer mnist --embed_method spatio-temporal --local_self_attn full --local_cross_attn full --global_self_attn full --global_cross_attn full --run_name mnist_spatiotemporal --context_points 10
Error trace:
2023-12-30 20:47:30.093968: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-30 20:47:30.094027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-30 20:47:30.095405: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-30 20:47:31.265649: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Using default wandb log dir path of ./data/STF_LOG_DIR. This can be adjusted with the environment variable
STF_LOG_DIR`Forecaster
L2: 1e-06
Linear Window: 0
Linear Shared Weights: False
RevIN: False
Decomposition: False
GlobalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
GlobalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
Using Embedding: spatio-temporal
Time Emb Dim: 6
Space Embedding: True
Time Embedding: True
Val Embedding: True
Given Embedding: True
Null Value: None
Pad Value: None
Reconstruction Dropout: Timesteps 0.05, Standard 0.1, Seq (max len = 5) 0.2, Skip All Drop 1.0
*** Spacetimeformer (v1.5) Summary: ***
Model Dim: 200
FF Dim: 800
Enc Layers: 3
Dec Layers: 3
Embed Dropout: 0.2
FF Dropout: 0.3
Attn Out Dropout: 0.0
Attn Matrix Dropout: 0.0
QKV Dropout: 0.0
L2 Coeff: 1e-06
Warmup Steps: 0
Normalization Scheme: batch
Attention Time Windows: 1
Shifted Time Windows: False
Position Emb Type: abs
Recon Loss Imp: 0.0
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./images/MNIST/raw/train-images-idx3-ubyte.gz
100% 9912422/9912422 [00:00<00:00, 199942825.48it/s]
Extracting ./images/MNIST/raw/train-images-idx3-ubyte.gz to ./images/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./images/MNIST/raw/train-labels-idx1-ubyte.gz
100% 28881/28881 [00:00<00:00, 149735097.43it/s]
Extracting ./images/MNIST/raw/train-labels-idx1-ubyte.gz to ./images/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./images/MNIST/raw/t10k-images-idx3-ubyte.gz
100% 1648877/1648877 [00:00<00:00, 43603948.10it/s]
Extracting ./images/MNIST/raw/t10k-images-idx3-ubyte.gz to ./images/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./images/MNIST/raw/t10k-labels-idx1-ubyte.gz
100% 4542/4542 [00:00<00:00, 32234397.24it/s]
Extracting ./images/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./images/MNIST/raw
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing
Trainer(accelerator='dp')
has been deprecated in v1.5 and will be removed in v1.7. UseTrainer(strategy='dp')
instead.rank_zero_deprecation(
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:597: UserWarning: 'dp' is not supported on CPUs, hence setting
strategy='ddp'
.rank_zero_warn(f"{strategy_flag!r} is not supported on CPUs, hence setting
strategy='ddp'
.")/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning:
max_epochs
was not set. Setting it to 1000 epochs. To train without an epoch limit, setmax_epochs=-1
.rank_zero_warn(
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1823: PossibleUserWarning: GPU available but not used. Set
accelerator
anddevices
usingTrainer(accelerator='gpu', devices=1)
.rank_zero_warn(
Trainer(limit_val_batches=1.0)
was configured so 100% of the batches will be used..Trainer(val_check_interval=1.0)
was configured so validation will run at the end of the training epoch..Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
Traceback (most recent call last):
File "/content/spacetimeformer/spacetimeformer/train.py", line 869, in
main(args)
File "/content/spacetimeformer/spacetimeformer/train.py", line 849, in main
trainer.fit(forecaster, datamodule=data_module)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1218, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 172, in setup
self.configure_ddp()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 294, in configure_ddp
self.model = self._setup_model(LightningDistributedModule(self.model))
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 178, in _setup_model
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 809, in init
self._ddp_init_helper(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1140, in _ddp_init_helper
self._passing_sync_batchnorm_handle(self.module)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2072, in _passing_sync_batchnorm_handle
self._log_and_throw(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
raise err_type(err_msg)
ValueError: SyncBatchNorm layers only work with GPU modules`
The text was updated successfully, but these errors were encountered: