Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when evaluate the trained mode on inria dataset, process handling #16

Open
XiaoyuSun-hub opened this issue Dec 9, 2020 · 4 comments
Open

Comments

@XiaoyuSun-hub
Copy link

XiaoyuSun-hub commented Dec 9, 2020

Hi, I installed the environment in Ubuntu 18.04. I first run the command

python main.py --config configs/config.inria_dataset_osm_aligned.unet_resnet101_pretrained
after training finish
I run
python main.py --config configs/config.inria_dataset_osm_aligned.unet_resnet101_pretrained --mode eval
the program will hanging there with the following output:
INFO: Loading defaults from configs/config.defaults.inria_dataset_osm_aligned.json
INFO: Loading defaults from configs/config.defaults.json
INFO: Loading defaults from configs/loss_params.json
INFO: Loading defaults from configs/optim_params.json
INFO: Loading defaults from configs/polygonize_params.json
INFO: Loading defaults from configs/dataset_params.inria_dataset_osm_aligned.json
INFO: Loading defaults from configs/eval_params.inria_dataset.json
INFO: Loading defaults from configs/eval_params.defaults.json
INFO: Loading defaults from configs/backbone_params.unet_resnet101.json
GPU 0 -> Using data from /gimastorage/Xiaoyu/data/AerialImageDataset
INFO: annotations will be loaded from disk
# --- Start evaluating ---#
Saving eval outputs to /gimastorage/Xiaoyu/data/AerialImageDataset/eval_runs/inria_dataset_osm_aligned.unet_resnet101_pretrained | 2020-12-05 09:55:09
Loading best val checkpoint: /home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/runs/inria_dataset_osm_aligned.unet_resnet101_pretrained | 2020-12-05 09:55:09/checkpoints/checkpoint.best_val.epoch_000001.tar
Eval test: 0%| | 0/34 [00:00<?, ?it/s]Traceback (most recent call last):

It will keep it still, if I stop the process, it gives following errors:
Process SpawnProcess-2:
Traceback (most recent call last):
File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 387, in
Traceback (most recent call last):
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/child_processes.py", line 75, in eval_process
evaluate(gpu, config, shared_dict, barrier, eval_ds, backbone)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/evaluate.py", line 62, in evaluate
evaluator.evaluate(split_name, eval_ds)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/evaluator.py", line 85, in evaluate
inference.inference_with_patching(self.config, self.model, tile_data)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/frame_field_learning/inference.py", line 79, in inference_with_patching
assert len(tile_data["image"].shape) == 4 and tile_data["image"].shape[0] == 1,
AssertionError: When using inference with patching, tile_data should have a batch size of 1, with image's shape being (1, C, H, W), not torch.Size([6, 3, 725, 725])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 26, in _wrap
sys.exit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
main() File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/util.py", line 334, in _exit_function
p.join()
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)

File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 381, in main
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
launch_eval(args)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/main.py", line 321, in launch_eval
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/sunx/Polygonization-by-Frame-Field-Learning/lydorn_utils/lydorn_utils/async_utils.py", line 8, in async_func_wrapper
if not out_queue.empty():
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/queues.py", line 123, in empty
return not self._poll()
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 924, in wait
selector.register(obj, selectors.EVENT_READ)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 352, in register
key = super().register(fileobj, events, data)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 244, in register
self._fd_to_key[key.fd] = key
KeyboardInterrupt
torch.multiprocessing.spawn(eval_process, nprocs=args.gpus, args=(config, shared_dict, barrier))
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 75, in join
ready = multiprocessing.connection.wait(
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/multiprocessing/connection.py", line 930, in wait
ready = selector.select(timeout)
File "/home/sunx/anaconda3/envs/frame_field1/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
Eval test: 0%| | 0/34 [13:02<?, ?it/s]

Process finished with exit code 130

I looked at the code of inference file
def inference_with_patching(config, model, tile_data):
*assert len(tile_data["image"].shape) == 4 and tile_data["image"].shape[0] == 1, *
f"When using inference with patching, tile_data should have a batch size of 1, "
f"with image's shape being (1, C, H, W), not {tile_data['image'].shape}"

Here the code assert needs the data to be a certain size which is different from the patch size.

I run the eval command twice, the output above is the second time, so there is no log about the patching process. the first time, it will first patch the test data.

Other things I do is reduce the data size by changing the code inside the inria_aerial.py

CITY_METADATA_DICT = {

"bellingham": {
    "fold": "test",
    "pixelsize": 0.3,
    "numbers": list([2,3]) ,
    "mean": [0.3766195, 0.391402, 0.32659722],
    "std": [0.18134978, 0.16412577, 0.16369793],
},

"austin": {
    "fold": "train",
    "pixelsize": 0.3,
    "numbers": list(range(1, 2)),
    "mean": [0.39584444, 0.40599795, 0.38298687],
    "std": [0.17341954, 0.16856597, 0.16360443],
}

}

@Dingyuan-Chen
Copy link

I have met the same question as yours. Have you solved it?

@Aria918
Copy link

Aria918 commented Mar 18, 2022

I also encountered this problem, have you solved it?

@kriti115
Copy link

I was able to overcome this issue by specifying the eval batch size while running the main.py like so:

   python main.py --config config_name --mode eval --eval_batch_size 1

But, I faced another problem immediately after which says:

  RuntimeError: The size of tensor a (1024) must match the size of tensor b (299) at non-singleton dimension 3 in inference.py

I tried to change the patch_size in the config file to 299, but that leads to another error. I would be glad if someone could shed some light on this if they have come across this issue.

Thank you.

@Shizw695
Copy link

Shizw695 commented Nov 2, 2023

I found a solution. Set "num_workers": 1 in 'config.defaults.json'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants