treeDepth-19-07-distributedTraining-128-grayscale.log

Starting the distributed training at Sat 20 Jul 09:09:30 BST 2024
Command running in screen session distributed_train_0_20240720-090930 on gpu36
Command running in screen session distributed_train_1_20240720-090930 on gpu31
Command running in screen session distributed_train_2_20240720-090930 on gpu03
Command running in screen session distributed_train_3_20240720-090930 on gpu32
Command running in screen session distributed_train_4_20240720-090930 on gpu28
Command running in screen session distributed_train_5_20240720-090930 on gpu20
Command running in screen session distributed_train_6_20240720-090930 on gpu22
Command running in screen session distributed_train_7_20240720-090930 on gpu19
Log file path is /homes/tp4618/Documents/bitbucket/SuperGlueThesis/external/glue-factory/treeDepth-19-07-distributedTraining-128-grayscale.log
Distributed training completed at Sat 20 Jul 09:09:44 BST 2024
[07/20/2024 09:09:50 gluefactory INFO] Starting experiment sp+lg_20-07-treedepth-grayscale-64
[07/20/2024 09:10:01 gluefactory INFO] Starting experiment sp+lg_20-07-treedepth-grayscale-64
0/2024 09:10:00 gluefactory INFO] Training in distributed mode with 8 GPUs
[07/20/2024 09:10:06 gluefactory INFO] Will fine-tune from weights of pretrain_lightglue
[07/20/2024 09:10:07 gluefactory INFO] Training in distributed mode with 8 GPUs
[07/20/2024 09:10:13 gluefactory INFO] Will fine-tune from weights of pretrain_lightglue
[07/20/2024 09:10:13 gluefactory INFO] Training in distributed mode with 8 GPUs
[07/20/2024 09:10:15 gluefactory INFO] Using device cuda:0
[07/20/2024 09:10:15 gluefactory.datasets.base_dataset INFO] Creating dataset TreeDepth
[07/20/2024 09:10:15 gluefactory.datasets.treedepth INFO] Initialized TreeDepth dataset with configuration: {'name': 'treedepth', 'num_workers': 1, 'train_batch_size': '???', 'val_batch_size': '???', 'test_batch_size': '???', 'shuffle_training': False, 'batch_size': 8, 'num_threads': 1, 'seed': 0, 'prefetch_factor': 2, 'data_dir': 'syntheticForestData/', 'depth_subpath': 'depthData/', 'image_subpath': 'imageData/', 'info_dir': 'fileLists', 'train_split': 'train_scenes_clean.txt', 'train_num_per_scene': 300, 'val_split': 'valid_scenes_clean.txt', 'val_num_per_scene': None, 'val_pairs': 'valid_pairs.txt', 'test_split': 'test_scenes_clean.txt', 'test_num_per_scene': None, 'test_pairs': None, 'views': 2, 'min_overlap': 0.1, 'max_overlap': 0.7, 'num_overlap_bins': 3, 'sort_by_overlap': False, 'triplet_enforce_overlap': False, 'read_depth': True, 'read_image': True, 'grayscale': False, 'preprocessing': {'resize': None, 'edge_divisible_by': None, 'side': 'long', 'interpolation': 'bilinear', 'align_corners': None, 'antialias': True, 'square_pad': False, 'add_padding_mask': False}, 'p_rotate': 0.0, 'reseed': False, 'load_features': {'do': False, 'path': 'exports/megadepth-undist-depth-r1024_SP-k2048-nms3/{scene}.h5', 'data_keys': None, 'device': None, 'trainable': False, 'add_data_path': True, 'collate': False, 'scale': ['keypoints', 'lines', 'orig_lines'], 'padding_fn': 'pad_local_features', 'padding_length': 2048, 'numeric_type': 'float32'}}
[07/20/2024 09:10:16 gluefactory.datasets.treedepth INFO] Sampling new items for train with seed 0.
[07/20/2024 09:10:17 gluefactory.datasets.treedepth INFO] Sampling new ite[07/20/2024 09:10:24 gluefactory INFO] Parameters with scaled learning rate:
{}
[07/20/2024 09:10:24 gluefactory INFO] Training with mixed_precision=None
[07/20/2024 09:10:24 gluefactory INFO] Starting training with configuration:
data:
  name: treedepth
  train_split: train_scenes_clean.txt
  train_num_per_scene: 300
  val_split: valid_scenes_clean.txt
  val_pairs: valid_pairs.txt
  min_overlap: 0.1
  max_overlap: 0.7
  num_overlap_bins: 3
  read_depth: true
  read_image: true
  batch_size: 64
  num_workers: 4
  load_features:
    do: false
    path: exports/megadepth-undist-depth-r1024_SP-k2048-nms3/{scene}.h5
    padding_length: 2048
    padding_fn: pad_local_features
model:
  name: two_view_pipeline
  extractor:
    name: extractors.superpoint_open
    max_num_keypoints: 2048
    force_num_keypoints: true
    detection_threshold: -1
    nms_radius: 3
    trainable: false
  ground_truth:
    name: matchers.depth_matcher
    th_positive: 3
    th_negative: 5
    th_epi: 5
  matcher:
    name: matchers.lightglue
    filter_threshold: 0.1
    flash: false
    checkpointed: true
  allow_no_extract: true
train:
  seed: 0
  epochs: 40
  optimizer: adam
  opt_regexp: null
  optimizer_options: {}
  lr: 0.0001
  lr_schedule:
    type: exp
    start: 30
    exp_div_10: 10
    on_epoch: true
    factor: 1.0
    options: {}
  lr_scaling:
  - - 100
    - - dampingnet.const
  eval_every_iter: 500
  save_every_iter: 5000
  log_every_iter: 100
  log_grad_every_iter: null
  test_every_epoch: 1
  keep_last_checkpoints: 10
  load_experiment: sp+lg_densehomography
  median_metrics: []
  recall_metrics: {}
  pr_metrics: {}
  best_key: loss/total
  dataset_callback_fn: sample_new_items
  dataset_callback_on_val: false
  clip_grad: 1.0
  pr_curves: {}
  plot:
  - 5
  - gluefactory.visualization.visualize_batch.make_match_figures
  submodules: []
benchmarks: null
image_mode: null

[07/20/2024 09:10:24 gluefactory INFO] Starting epoch 0


loading pretrain lightglue, replacing model


/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/outputs/training/sp+lg_densehomography/checkpoint_39_61799.tar
loading pretrain lightglue, replacing model
Configuration fields in conf.model:
renaming old state dictionary keys
lines 410 of train.py with renaming keys may not work?
{'name': 'two_view_pipeline', 'extractor': {'name': 'extractors.superpoint_open', 'max_num_keypoints': 2048, 'force_num_keypoints': True, 'detection_threshold': -1, 'nms_radius': 3, 'trainable': False}, 'ground_truth': {'name': 'matchers.depth_matcher', 'th_positive': 3, 'th_negative': 5, 'th_epi': 5}, 'matcher': {'name': 'matchers.lightglue', 'filter_threshold': 0.1, 'flash': False, 'checkpointed': True}, 'allow_no_extract': True}
the rank is 3 and world size is 8
cuda:0
23 initpy data sets path gluefactory.datasets.treedepth
205 tools.py utils the classes are  [('TreeDepth', <class 'gluefactory.datasets.treedepth.TreeDepth'>)]
136 in treedepth
conf keys: dict_keys(['name', 'num_workers', 'train_batch_size', 'val_batch_size', 'test_batch_size', 'shuffle_training', 'batch_size', 'num_threads', 'seed', 'prefetch_factor', 'data_dir', 'depth_subpath', 'image_subpath', 'info_dir', 'train_split', 'train_num_per_scene', 'val_split', 'val_num_per_scene', 'val_pairs', 'test_split', 'test_num_per_scene', 'test_pairs', 'views', 'min_overlap', 'max_overlap', 'num_overlap_bins', 'sort_by_overlap', 'triplet_enforce_overlap', 'read_depth', 'read_image', 'grayscale', 'preprocessing', 'p_rotate', 'reseed', 'load_features'])
Loaded training dataset: treedepth
Using the same dataset for validation as for training.
calling get_data_loader on dataset object - dataset.get_data_loader('train', distributed=args.distributed, shuffle=False)


scene_lists_path: /homes/tp4618/Documents/bitbucket/SuperGlueThesis/external/glue-factory/gluefactory/datasets/tartanSceneLists(Full)


root and info /vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/data/syntheticForestData fileLists
!!!!!sample_new_items: num_per_scene: 300
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overusing the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices
Not using the overlap matrices


scene_lists_path: /homes/tp4618/Documents/bitbucket/SuperGlueThesis/external/glue-factory/gluefactory/datasets/tartanSceneLists(Full)


root and info /vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/data/syntheticForestData fileLists
!!!!!sample_new_items: num_per_scene: None
205 tools.py utils the classes are  [('TwoViewPipeline', <class 'gluefactory.models.two_view_pipeline.TwoViewPipeline'>)]
ModuleNotFoundError: extractors.superpoint_open
205 tools.py utils the classes are  [('SuperPoint', <class 'gluefactory.models.extractors.superpoint_open.SuperPoint'>)]
None


Superpoint grayscale mode selected
ModuleNotFoundError: matchers.lightglue
205 tools.py utils the classes are  []


 if Path(conf.weights).exists() 368 lightglue loading weights: /homes/tp4618/Documents/bitbucket/SuperGlueThesis/external/glue-factory/outputs/training/pretrain_lightglue/superpoint_lightglue.pth
state dict lightlue 385 9 to replace keys in range
remvoing log assingments
ModuleNotFoundError: matchers.depth_matcher
205 tools.py utils the classes are  [('DepthMatcher', <class 'gluefactory.models.matchers.depth_matcher.DepthMatcher'>)]
initcp not none
args distibuted
device ids are [device(type='cuda', index=0)]
/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
[07/20/2024 09:10:48 gluefactory INFO] [E 0 | it 0] loss {total 8.629E+00, last 9.344E+00, assignment_nll 9.344E+00, nll_pos 1.730E+01, nll_neg 1.387E+00, num_matchable 8.428E+01, num_unmatchable 8.849E+02, confidence 4.062E-01, row_norm 4.628E-01}
debugging in data_to_log, iteraiton: 0
Data to be logged to training DataFrame: {'epoch': 0, 'iteration': 0, 'total': 8.628652572631836, 'last': 9.344147682189941, 'assignment_nll': 9.344147682189941, 'nll_pos': 17.300830841064453, 'nll_neg': 1.387465476989746, 'num_matchable': 84.28125, 'num_unmatchable': 884.921875, 'confidence': 0.40620216727256775, 'row_norm': 0.4627940356731415}
         /vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py:905: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  results_df = pd.concat([results_df, new_entry], ignore_index=True)
Data to be logged: {'epoch': 0, 'iteration': 0, 'match_recall': 0.00025607182371682594, 'match_precision': 0.013205019423105953, 'accuracy': 0.8352571979802613, 'average_precision': 8.174625686510227e-05, 'loss/total': 7.098547302416068, 'loss/last': 7.098547302416068, 'loss/assignment_nll': 7.098547302416068, 'loss/nll_pos': 13.237686495483912, 'loss/nll_neg': 0.9594081233013393, 'loss/num_matchable': 201.72711150520948, 'loss/num_unmatchable': 1079.2036133894924, 'loss/row_norm': 0.4093568108875137}
[07/20/2024 09:26:18 gluefactory ERROR] CUDA Out of Memory Error ???
[07/20/2024 09:26:18 gluefactory ERROR] Hostname: gpu03.doc.ic.ac.uk
[07/20/2024 09:26:18 gluefactory ERROR] Error occurred at: 2024-07-20 09:26:18
[07/20/2024 09:26:18 gluefactory ERROR] Unable to get GPU Memory Usage: 'allocated_bytes.all.current'
Traceback (most recent call last):
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 1168, in <module>
    torch.multiprocessing.spawn(
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 1083, in main_worker
    training(rank, conf, output_dir, args, results_df, train_df)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 917, in training
    filename = f"{conf.train.load_experiment}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl"
AttributeError: module 'datetime' has no attribute 'now'

/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py:905: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  results_df = pd.concat([results_df, new_entry], ignore_index=True)
Data to be logged: {'epoch': 0, 'iteration': 0, 'match_recall': 0.0002560694585063969, 'match_precision': 0.013205019423105953, 'accuracy': 0.8352570635229647, 'average_precision': 8.174625686510227e-05, 'loss/total': 7.0985433333888865, 'loss/last': 7.0985433333888865, 'loss/assignment_nll': 7.0985433333888865, 'loss/nll_pos': 13.237677355770312, 'loss/nll_neg': 0.9594093340512417, 'loss/num_matchable': 201.72777654622035, 'loss/num_unmatchable': 1079.2037242296608, 'loss/row_norm': 0.4093568108875137}
[07/20/2024 09:26:24 gluefactory ERROR] CUDA Out of Memory Error ???
[07/20/2024 09:26:24 gluefactory ERROR] Hostname: gpu22.doc.ic.ac.uk
[07/20/2024 09:26:24 gluefactory ERROR] Error occurred at: 2024-07-20 09:26:24
[07/20/2024 09:26:24 gluefactory ERROR] Unable to get GPU Memory Usage: 'allocated_bytes.all.current'
Traceback (most recent call last):
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 1168, in <module>
    torch.multiprocessing.spawn(
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/homes/tp4618/Documents/bitbucket/miniconda3/envs/GlueFactory/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 1083, in main_worker
    training(rank, conf, output_dir, args, results_df, train_df)
  File "/vol/bitbucket/tp4618/SuperGlueThesis/external/glue-factory/gluefactory/train.py", line 917, in training
    filename = f"{conf.train.load_experiment}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl"
AttributeError: module 'datetime' has no attribute 'now'