Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: SyncBatchNorm layers only work with GPU modules #90

Open
LundinMachine opened this issue Dec 31, 2023 · 2 comments
Open

ValueError: SyncBatchNorm layers only work with GPU modules #90

LundinMachine opened this issue Dec 31, 2023 · 2 comments

Comments

@LundinMachine
Copy link

Looks like the GPU in colab is not being engaged. Tried using A100, V100, T4 GPU, and TPU hardware settings in colab. command:
python train.py spacetimeformer mnist --embed_method spatio-temporal --local_self_attn full --local_cross_attn full --global_self_attn full --global_cross_attn full --run_name mnist_spatiotemporal --context_points 10

Error trace:
2023-12-30 20:47:30.093968: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-30 20:47:30.094027: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-30 20:47:30.095405: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-30 20:47:31.265649: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Using default wandb log dir path of ./data/STF_LOG_DIR. This can be adjusted with the environment variable STF_LOG_DIR`
Forecaster
L2: 1e-06
Linear Window: 0
Linear Shared Weights: False
RevIN: False
Decomposition: False
GlobalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
GlobalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalSelfAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
LocalCrossAttn: AttentionLayer(
(inner_attention): FullAttention(
(dropout): Dropout(p=0.0, inplace=False)
)
(query_projection): Linear(in_features=200, out_features=800, bias=True)
(key_projection): Linear(in_features=200, out_features=800, bias=True)
(value_projection): Linear(in_features=200, out_features=800, bias=True)
(out_projection): Linear(in_features=800, out_features=200, bias=True)
(dropout_qkv): Dropout(p=0.0, inplace=False)
)
Using Embedding: spatio-temporal
Time Emb Dim: 6
Space Embedding: True
Time Embedding: True
Val Embedding: True
Given Embedding: True
Null Value: None
Pad Value: None
Reconstruction Dropout: Timesteps 0.05, Standard 0.1, Seq (max len = 5) 0.2, Skip All Drop 1.0
*** Spacetimeformer (v1.5) Summary: ***
Model Dim: 200
FF Dim: 800
Enc Layers: 3
Dec Layers: 3
Embed Dropout: 0.2
FF Dropout: 0.3
Attn Out Dropout: 0.0
Attn Matrix Dropout: 0.0
QKV Dropout: 0.0
L2 Coeff: 1e-06
Warmup Steps: 0
Normalization Scheme: batch
Attention Time Windows: 1
Shifted Time Windows: False
Position Emb Type: abs
Recon Loss Imp: 0.0


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./images/MNIST/raw/train-images-idx3-ubyte.gz
100% 9912422/9912422 [00:00<00:00, 199942825.48it/s]
Extracting ./images/MNIST/raw/train-images-idx3-ubyte.gz to ./images/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./images/MNIST/raw/train-labels-idx1-ubyte.gz
100% 28881/28881 [00:00<00:00, 149735097.43it/s]
Extracting ./images/MNIST/raw/train-labels-idx1-ubyte.gz to ./images/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./images/MNIST/raw/t10k-images-idx3-ubyte.gz
100% 1648877/1648877 [00:00<00:00, 43603948.10it/s]
Extracting ./images/MNIST/raw/t10k-images-idx3-ubyte.gz to ./images/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./images/MNIST/raw/t10k-labels-idx1-ubyte.gz
100% 4542/4542 [00:00<00:00, 32234397.24it/s]
Extracting ./images/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./images/MNIST/raw

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing Trainer(accelerator='dp') has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy='dp') instead.
rank_zero_deprecation(
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:597: UserWarning: 'dp' is not supported on CPUs, hence setting strategy='ddp'.
rank_zero_warn(f"{strategy_flag!r} is not supported on CPUs, hence setting strategy='ddp'.")
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:91: PossibleUserWarning: max_epochs was not set. Setting it to 1000 epochs. To train without an epoch limit, set max_epochs=-1.
rank_zero_warn(
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py:1823: PossibleUserWarning: GPU available but not used. Set accelerator and devices using Trainer(accelerator='gpu', devices=1).
rank_zero_warn(
Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..
Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=gloo
All distributed processes registered. Starting with 1 processes

Traceback (most recent call last):
File "/content/spacetimeformer/spacetimeformer/train.py", line 869, in
main(args)
File "/content/spacetimeformer/spacetimeformer/train.py", line 849, in main
trainer.fit(forecaster, datamodule=data_module)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1218, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 172, in setup
self.configure_ddp()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 294, in configure_ddp
self.model = self._setup_model(LightningDistributedModule(self.model))
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 178, in _setup_model
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 809, in init
self._ddp_init_helper(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1140, in _ddp_init_helper
self._passing_sync_batchnorm_handle(self.module)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 2072, in _passing_sync_batchnorm_handle
self._log_and_throw(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
raise err_type(err_msg)
ValueError: SyncBatchNorm layers only work with GPU modules`

@pdy265
Copy link

pdy265 commented Apr 22, 2024

have you soloved this problem?

@KKindom
Copy link

KKindom commented Jul 10, 2024

我也遇到这个问题了,发现问题是,必须要安装pytorchgpu版本 ,其次在运行训练的文件最后加上 --gpus 0 像是这样spacetimeformer mnist --embed_method spatio-temporal --local_self_attn full --local_cross_attn full --global_self_attn full --global_cross_attn full --run_name mnist_spatiotemporal --context_points 20 --gpus 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants