'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

Swall0w · 2021-05-25T12:16:24Z

Thanks for your great work!!
When I applied AMP training on detectron2, I found some issues with boxes in the training.

Changed

The difference from the original code is here.

SOLVER:
    STEPS: (210000, 250000)
    MAX_ITER: 270000
    AMP:
        ENABLED: true

Error

[05/24 20:54:12 d2.engine.hooks]: Total training time: 0:00:10 (0:00:00 on hooks)
[05/24 20:54:12 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 5095M
Traceback (most recent call last):
  File "train_net.py", line 134, in <module>
    launch(
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
    return trainer.train()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
    loss_dict = self.model(data)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
    loss_dict = self.criterion(output, targets)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 266, in forward
    cost_giou = -generalized_box_iou(out_bbox, tgt_bbox)
  File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/util/box_ops.py", line 51, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Decreasing the learning rate doesn't work for me and this error occurs only mix training.
Is there any good suggestion to solve this problem?

Thank you.

PeizeSun · 2021-05-31T02:13:52Z

Hi~
Can you try to delete giou, including matching and loss, to see whether this error still occurs?

Swall0w · 2021-06-03T14:37:01Z

@PeizeSun
Thank you for your suggestion.
After commenting out the giou, i got a new error...

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
    return trainer.train()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
    self.run_step()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
    loss_dict = self.model(data)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
    loss_dict = self.criterion(output, targets)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in forward
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
  File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in <listcomp>
    indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
  File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/scipy/optimize/_lsap.py", line 101, in linear_sum_assignment
    a, b = _lsap_module.calculate_assignment(cost_matrix.T)
ValueError: matrix contains invalid numeric entries

PeizeSun · 2021-06-05T10:23:40Z

Can you print out cost_matrix to see which entry is invalid?

shivamsnaik · 2022-01-22T12:51:30Z

I am getting the same issue when I try to run Sparse-RCNN with a learning rate of 0.02 for 8 GPUs. Did you find the solution to this problem?.
It would be of great help if you actually did solve the problem. @Swall0w .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

Swall0w commented May 25, 2021

PeizeSun commented May 31, 2021

Swall0w commented Jun 3, 2021 •

edited

Loading

PeizeSun commented Jun 5, 2021

shivamsnaik commented Jan 22, 2022

'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #89

Comments

Swall0w commented May 25, 2021

Changed

Error

PeizeSun commented May 31, 2021

Swall0w commented Jun 3, 2021 • edited Loading

PeizeSun commented Jun 5, 2021

shivamsnaik commented Jan 22, 2022

Swall0w commented Jun 3, 2021 •

edited

Loading