diff --git a/README.md b/README.md index cc5201369..84edcc202 100644 --- a/README.md +++ b/README.md @@ -20,8 +20,7 @@ * [2021.02.03] Support [EfficientNet-Lite](https://github.com/RangiLyu/EfficientNet-Lite) and [Rep-VGG](https://github.com/DingXiaoH/RepVGG) backbone. Please check the [config folder](config/). Download models in [Model Zoo](#model-zoo) * [2021.01.10] **NanoDet-g** with lower memory access cost, which designed for edge NPU or GPU, is now available! - Check [config/nanodet-g.yml](config/nanodet-g.yml) and download: - [COCO pre-trained model(Google Drive)](https://drive.google.com/file/d/10uW7oqZKw231l_tr4C1bJWkbCXgBf7av/view?usp=sharing) | [(BaiduDisk百度网盘)](https://pan.baidu.com/s/1IJLdtLBvmQVOmzzNY_Ci5A) code:otcd + Check [config/nanodet-g.yml](config/nanodet-g.yml) and download in [Model Zoo](#model-zoo).
More... @@ -93,9 +92,8 @@ Inference using [Alibaba's MNN framework](https://github.com/alibaba/MNN) is in ### Pytorch demo First, install requirements and setup NanoDet following installation guide. Then download COCO pretrain weight from here -👉[COCO pretrain weight for torch>=1.6(Google Drive)](https://drive.google.com/file/d/1EhMqGozKfqEfw8y9ftbi1jhYu86XoW62/view?usp=sharing) | [(百度网盘)](https://pan.baidu.com/s/1LCnmj2Pqhv0tsDX__1j2gg) code:6au1 -👉[COCO pretrain weight for torch<=1.5(Google Drive)](https://drive.google.com/file/d/10h-0qLMCgYvWQvKULqbkLvmirFR-w9NN/view?usp=sharing) | [(百度云盘)](https://pan.baidu.com/s/1OTcPiajCcqKLg3Q0vwho3A) code:topw +👉[COCO pretrain weight (Google Drive)](https://drive.google.com/file/d/1ZkYucuLusJrCb_i63Lid0kYyyLvEiGN3/view?usp=sharing) * Inference images @@ -141,13 +139,13 @@ Besides, We provide a notebook [here](./demo/demo-inference-with-pytorch.ipynb) 2. Install pytorch ```shell script -conda install pytorch torchvision cudatoolkit=11.0 -c pytorch +conda install pytorch torchvision cudatoolkit=11.1 -c pytorch ``` 3. Install requirements ```shell script -pip install Cython termcolor numpy tensorboard pycocotools matplotlib pyaml opencv-python tqdm +pip install Cython termcolor numpy tensorboard pycocotools matplotlib pyaml opencv-python tqdm pytorch-lightning torchmetrics ``` 4. Setup NanoDet @@ -166,14 +164,14 @@ NanoDet supports variety of backbones. Go to the [***config*** folder](config/) Model | Backbone |Resolution|COCO mAP| FLOPS |Params | Pre-train weight | :--------------------:|:------------------:|:--------:|:------:|:-----:|:-----:|:-----:| -NanoDet-m | ShuffleNetV2 1.0x | 320*320 | 20.6 | 0.72B | 0.95M | [Download](https://drive.google.com/file/d/10h-0qLMCgYvWQvKULqbkLvmirFR-w9NN/view?usp=sharing) | -NanoDet-m-416 | ShuffleNetV2 1.0x | 416*416 | 23.5 | 1.2B | 0.95M | [Download](https://drive.google.com/file/d/1h6TBy1tx4faIBKHnYeg0QwzFF6wlFBEd/view?usp=sharing)| -NanoDet-t (***NEW***) | ShuffleNetV2 1.0x | 320*320 | 21.7 | 0.96B | 1.36M | [Download](https://drive.google.com/file/d/1O2iz-aaDiQHJNfocInpFrY8ZFMrT3M1r/view?usp=sharing) | -NanoDet-g | Custom CSP Net | 416*416 | 22.9 | 4.2B | 3.81M | [Download](https://drive.google.com/file/d/10uW7oqZKw231l_tr4C1bJWkbCXgBf7av/view?usp=sharing)| -NanoDet-EfficientLite | EfficientNet-Lite0 | 320*320 | 24.7 | 1.72B | 3.11M | [Download](https://drive.google.com/file/d/1u_t9L0jqjH858gCR-vpzWzu9FexQOSmJ/view?usp=sharing)| -NanoDet-EfficientLite | EfficientNet-Lite1 | 416*416 | 30.3 | 4.06B | 4.01M | [Download](https://drive.google.com/file/d/1y9z7BToAZOQ1pKbOjNjf79YMuFuDTvfq/view?usp=sharing) | -NanoDet-EfficientLite | EfficientNet-Lite2 | 512*512 | 32.6 | 7.12B | 4.71M | [Download](https://drive.google.com/file/d/1UMXJJxRkRzgTvN1iRKeDZqGpkLxK3X4K/view?usp=sharing) | -NanoDet-RepVGG | RepVGG-A0 | 416*416 | 27.8 | 11.3B | 6.75M | [Download](https://drive.google.com/file/d/1bsT9Ksxws2O3g_IUuUwp0QwZcJlqJw3S/view?usp=sharing) | +NanoDet-m | ShuffleNetV2 1.0x | 320*320 | 20.6 | 0.72B | 0.95M | [Download](https://drive.google.com/file/d/1ZkYucuLusJrCb_i63Lid0kYyyLvEiGN3/view?usp=sharing) | +NanoDet-m-416 | ShuffleNetV2 1.0x | 416*416 | 23.5 | 1.2B | 0.95M | [Download](https://drive.google.com/file/d/1jY-Um2VDDEhuVhluP9lE70rG83eXQYhV/view?usp=sharing)| +NanoDet-t (***NEW***) | ShuffleNetV2 1.0x | 320*320 | 21.7 | 0.96B | 1.36M | [Download](https://drive.google.com/file/d/1TqRGZeOKVCb98ehTaE0gJEuND6jxwaqN/view?usp=sharing) | +NanoDet-g | Custom CSP Net | 416*416 | 22.9 | 4.2B | 3.81M | [Download](https://drive.google.com/file/d/1f2lH7Ae1AY04g20zTZY7JS_dKKP37hvE/view?usp=sharing)| +NanoDet-EfficientLite | EfficientNet-Lite0 | 320*320 | 24.7 | 1.72B | 3.11M | [Download](https://drive.google.com/file/d/1Dj1nBFc78GHDI9Wn8b3X4MTiIV2el8qP/view?usp=sharing)| +NanoDet-EfficientLite | EfficientNet-Lite1 | 416*416 | 30.3 | 4.06B | 4.01M | [Download](https://drive.google.com/file/d/1ernkb_XhnKMPdCBBtUEdwxIIBF6UVnXq/view?usp=sharing) | +NanoDet-EfficientLite | EfficientNet-Lite2 | 512*512 | 32.6 | 7.12B | 4.71M | [Download](https://drive.google.com/file/d/11V20AxXe6bTHyw3aMkgsZVzLOB31seoc/view?usp=sharing) | +NanoDet-RepVGG | RepVGG-A0 | 416*416 | 27.8 | 11.3B | 6.75M | [Download](https://drive.google.com/file/d/1nWZZ1qXb1HuIXwPSYpEyFHHqX05GaFer/view?usp=sharing) | **** @@ -194,9 +192,9 @@ NanoDet-RepVGG | RepVGG-A0 | 416*416 | 27.8 | 11.3B | 6.75M | Change ***num_classes*** in ***model->arch->head***. - Change image path and annotation path in both ***data->train data->val***. + Change image path and annotation path in both ***data->train*** and ***data->val***. - Set gpu, workers and batch size in ***device*** to fit your device. + Set gpu ids, num workers and batch size in ***device*** to fit your device. Set ***total_epochs***, ***lr*** and ***lr_schedule*** according to your dataset and batchsize. @@ -204,25 +202,33 @@ NanoDet-RepVGG | RepVGG-A0 | 416*416 | 27.8 | 11.3B | 6.75M | 3. **Start training** - For single GPU, run + NanoDet is now using [pytorch lightning](https://github.com/PyTorchLightning/pytorch-lightning) for training. + + For both single-GPU or multiple-GPUs, run: + + ```shell script + python tools/train.py CONFIG_FILE_PATH + ``` + + Old training script is deprecated and will be deleted in next version. If you still want to use, + +
+ follow this... + + For single GPU, run ```shell script - python tools/train.py CONFIG_PATH + python tools/deprecated/train.py CONFIG_FILE_PATH ``` For multi-GPU, NanoDet using distributed training. (Notice: Windows not support distributed training before pytorch1.7) Please run ```shell script - python -m torch.distributed.launch --nproc_per_node=GPU_NUM --master_port 29501 tools/train.py CONFIG_PATH + python -m torch.distributed.launch --nproc_per_node=GPU_NUM --master_port 29501 tools/deprecated/train.py CONFIG_FILE_PATH ``` - **Experimental**: - - Training with [pytorch lightning](https://github.com/PyTorchLightning/pytorch-lightning), no matter single or multi GPU just run: - - ```shell script - python tools/train_pl.py CONFIG_PATH - ``` +
+ 4. **Visualize Logs** @@ -232,7 +238,7 @@ NanoDet-RepVGG | RepVGG-A0 | 416*416 | 27.8 | 11.3B | 6.75M | ```shell script cd - tensorboard --logdir ./logs + tensorboard --logdir ./ ``` **** diff --git a/docs/config_file_detail.md b/docs/config_file_detail.md index 2d034492a..075036e13 100644 --- a/docs/config_file_detail.md +++ b/docs/config_file_detail.md @@ -15,7 +15,7 @@ Change save_dir to where you want to save logs and models. If path not exist, Na ```yaml model: arch: - name: xxx + name: OneStageDetector backbone: xxx fpn: xxx head: xxx diff --git a/nanodet/evaluator/coco_detection.py b/nanodet/evaluator/coco_detection.py index 806f4ebb7..6b744a63a 100644 --- a/nanodet/evaluator/coco_detection.py +++ b/nanodet/evaluator/coco_detection.py @@ -48,7 +48,7 @@ def results2json(self, results): json_results.append(detection) return json_results - def evaluate(self, results, save_dir, epoch, logger, rank=-1): + def evaluate(self, results, save_dir, rank=-1): results_json = self.results2json(results) json_path = os.path.join(save_dir, 'results{}.json'.format(rank)) json.dump(results_json, open(json_path, 'w')) @@ -61,5 +61,4 @@ def evaluate(self, results, save_dir, epoch, logger, rank=-1): eval_results = {} for k, v in zip(self.metric_names, aps): eval_results[k] = v - logger.scalar_summary('Val_coco_bbox/' + k, 'val', v, epoch) return eval_results diff --git a/nanodet/trainer/task.py b/nanodet/trainer/task.py index 43ae72e23..ca22c615c 100644 --- a/nanodet/trainer/task.py +++ b/nanodet/trainer/task.py @@ -15,6 +15,7 @@ import copy import os import warnings +import json import torch import logging from pytorch_lightning import LightningModule @@ -27,25 +28,20 @@ class TrainingTask(LightningModule): """ Pytorch Lightning module of a general training task. + Including training, evaluating and testing. + Args: + cfg: Training configurations + evaluator: Evaluator for evaluating the model performance. """ - def __init__(self, cfg, evaluator=None, logger=None): - """ - - Args: - cfg: Training configurations - evaluator: - logger: - """ + def __init__(self, cfg, evaluator=None): super(TrainingTask, self).__init__() self.cfg = cfg self.model = build_model(cfg.model) self.evaluator = evaluator - self._logger = logger self.save_flag = -10 self.log_style = 'NanoDet' # Log style. Choose between 'NanoDet' or 'Lightning' # TODO: use callback to log - # TODO: remove _logger # TODO: batch eval # TODO: support old checkpoint @@ -54,7 +50,7 @@ def forward(self, x): return x @torch.no_grad() - def predict(self, batch, batch_idx, dataloader_idx): + def predict(self, batch, batch_idx=None, dataloader_idx=None): preds = self.forward(batch['img']) results = self.model.head.post_process(preds, batch) return results @@ -103,11 +99,17 @@ def validation_step(self, batch, batch_idx): return res def validation_epoch_end(self, validation_step_outputs): + """ + Called at the end of the validation epoch with the outputs of all validation steps. + Evaluating results and save best model. + Args: + validation_step_outputs: A list of val outputs + + """ results = {} for res in validation_step_outputs: results.update(res) - eval_results = self.evaluator.evaluate(results, self.cfg.save_dir, self.current_epoch+1, - self._logger, rank=self.local_rank) + eval_results = self.evaluator.evaluate(results, self.cfg.save_dir, rank=self.local_rank) metric = eval_results[self.cfg.evaluator.save_key] # save best model if metric > self.save_flag: @@ -125,9 +127,39 @@ def validation_epoch_end(self, validation_step_outputs): warnings.warn('Warning! Save_key is not in eval results! Only save model last!') if self.log_style == 'Lightning': for k, v in eval_results.items(): - self.log('Val/' + k, v, on_step=False, on_epoch=True, prog_bar=False, sync_dist=True) + self.log('Val_metrics/' + k, v, on_step=False, on_epoch=True, prog_bar=False, sync_dist=True) + elif self.log_style == 'NanoDet': + for k, v in eval_results.items(): + self.scalar_summary('Val_metrics/' + k, 'Val', v, self.current_epoch+1) + + def test_step(self, batch, batch_idx): + dets = self.predict(batch, batch_idx) + res = {batch['img_info']['id'].cpu().numpy()[0]: dets} + return res + + def test_epoch_end(self, test_step_outputs): + results = {} + for res in test_step_outputs: + results.update(res) + res_json = self.evaluator.results2json(results) + json_path = os.path.join(self.cfg.save_dir, 'results.json') + json.dump(res_json, open(json_path, 'w')) + + if self.cfg.test_mode == 'val': + eval_results = self.evaluator.evaluate(results, self.cfg.save_dir, rank=self.local_rank) + txt_path = os.path.join(self.cfg.save_dir, "eval_results.txt") + with open(txt_path, "a") as f: + for k, v in eval_results.items(): + f.write("{}: {}\n".format(k, v)) def configure_optimizers(self): + """ + Prepare optimizer and learning-rate scheduler + to use in optimization. + + Returns: + optimizer + """ optimizer_cfg = copy.deepcopy(self.cfg.schedule.optimizer) name = optimizer_cfg.pop('name') build_optimizer = getattr(torch.optim, name) @@ -153,6 +185,18 @@ def optimizer_step(self, on_tpu=None, using_native_amp=None, using_lbfgs=None): + """ + Performs a single optimization step (parameter update). + Args: + epoch: Current epoch + batch_idx: Index of current batch + optimizer: A PyTorch optimizer + optimizer_idx: If you used multiple optimizers this indexes into that list. + optimizer_closure: closure for all optimizers + on_tpu: true if TPU backward is required + using_native_amp: True if using native amp + using_lbfgs: True if the matching optimizer is lbfgs + """ # warm up lr if self.trainer.global_step <= self.cfg.schedule.warmup.steps: if self.cfg.schedule.warmup.name == 'constant': @@ -180,6 +224,15 @@ def get_progress_bar_dict(self): return items def scalar_summary(self, tag, phase, value, step): + """ + Write Tensorboard scalar summary log. + Args: + tag: Name for the tag + phase: 'Train' or 'Val' + value: Value to record + step: Step value to record + + """ if self.local_rank < 1: self.logger.experiment.add_scalars(tag, {phase: value}, step) diff --git a/nanodet/trainer/trainer.py b/nanodet/trainer/trainer.py index 1a76f6808..08b4ce645 100644 --- a/nanodet/trainer/trainer.py +++ b/nanodet/trainer/trainer.py @@ -145,7 +145,9 @@ def run(self, train_loader, val_loader, evaluator): results, val_loss_dict = self.run_epoch(self.epoch, val_loader, mode='val') for k, v in val_loss_dict.items(): self.logger.scalar_summary('Epoch_loss/' + k, 'val', v, epoch) - eval_results = evaluator.evaluate(results, self.cfg.save_dir, epoch, self.logger, rank=self.rank) + eval_results = evaluator.evaluate(results, self.cfg.save_dir, rank=self.rank) + for k, v in eval_results.items(): + self.logger.scalar_summary('Val_metrics/' + k, 'val', v, epoch) if self.cfg.evaluator.save_key in eval_results: metric = eval_results[self.cfg.evaluator.save_key] if metric > save_flag: diff --git a/nanodet/util/__init__.py b/nanodet/util/__init__.py index f6be5366d..00aa83227 100644 --- a/nanodet/util/__init__.py +++ b/nanodet/util/__init__.py @@ -3,7 +3,7 @@ from .logger import Logger, MovingAverage, AverageMeter from .data_parallel import DataParallel from .distributed_data_parallel import DDP -from .check_point import load_model_weight, save_model +from .check_point import load_model_weight, save_model, convert_old_model from .config import cfg, load_config from .box_transform import * from .util_mixins import NiceRepr diff --git a/nanodet/util/check_point.py b/nanodet/util/check_point.py index 4c9178630..cc441d7d3 100644 --- a/nanodet/util/check_point.py +++ b/nanodet/util/check_point.py @@ -1,11 +1,16 @@ import torch +import pytorch_lightning as pl +from collections import OrderedDict from .rank_filter import rank_filter + def load_model_weight(model, checkpoint, logger): state_dict = checkpoint['state_dict'] # strip prefix of state_dict if list(state_dict.keys())[0].startswith('module.'): state_dict = {k[7:]: v for k, v in checkpoint['state_dict'].items()} + if list(state_dict.keys())[0].startswith('model.'): + state_dict = {k[6:]: v for k, v in checkpoint['state_dict'].items()} model_state_dict = model.module.state_dict() if hasattr(model, 'module') else model.state_dict() @@ -35,3 +40,27 @@ def save_model(model, path, epoch, iter, optimizer=None): data['optimizer'] = optimizer.state_dict() torch.save(data, path) + + +def convert_old_model(old_model_dict): + if 'pytorch-lightning_version' in old_model_dict: + raise ValueError('This model is not old format. No need to convert!') + version = pl.__version__ + epoch = old_model_dict['epoch'] + global_step = old_model_dict['iter'] + state_dict = old_model_dict['state_dict'] + new_state_dict = OrderedDict() + for name, value in state_dict.items(): + new_state_dict['model.' + name] = value + + new_checkpoint = {'epoch': epoch, + 'global_step': global_step, + 'pytorch-lightning_version': version, + 'state_dict': new_state_dict, + 'lr_schedulers': []} + + if 'optimizer' in old_model_dict: + optimizer_states = [old_model_dict['optimizer']] + new_checkpoint['optimizer_states'] = optimizer_states + + return new_checkpoint diff --git a/setup.py b/setup.py index b70f7b756..2358d70f9 100644 --- a/setup.py +++ b/setup.py @@ -1,7 +1,7 @@ #!/usr/bin/env python from setuptools import find_packages, setup -__version__ = "0.2.1" +__version__ = "0.3.0" if __name__ == '__main__': setup( diff --git a/tools/convert_old_checkpoint.py b/tools/convert_old_checkpoint.py new file mode 100644 index 000000000..1c64dddc6 --- /dev/null +++ b/tools/convert_old_checkpoint.py @@ -0,0 +1,42 @@ +# Copyright 2021 RangiLyu. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse +import torch + +from nanodet.util import convert_old_model + + +def parse_args(): + parser = argparse.ArgumentParser( + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + description='Convert .pth model to onnx.') + parser.add_argument('--file_path', + type=str, + help='Path to .pth checkpoint.') + parser.add_argument('--out_path', + type=str, + help='Path to .ckpt checkpoint.') + return parser.parse_args() + + +if __name__ == '__main__': + args = parse_args() + file_path = args.file_path + out_path = args.out_path + old_check_point = torch.load(file_path) + new_check_point = convert_old_model(old_check_point) + torch.save(new_check_point, out_path) + print("Checkpoint saved to:", out_path) diff --git a/tools/deprecated/test.py b/tools/deprecated/test.py new file mode 100644 index 000000000..4555594d5 --- /dev/null +++ b/tools/deprecated/test.py @@ -0,0 +1,69 @@ +import os +import torch +import json +import datetime +import argparse +import warnings + +from nanodet.util import mkdir, Logger, cfg, load_config +from nanodet.trainer import build_trainer +from nanodet.data.collate import collate_function +from nanodet.data.dataset import build_dataset +from nanodet.model.arch import build_model +from nanodet.evaluator import build_evaluator + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument('--task', type=str, default='val', help='task to run, test or val') + parser.add_argument('--config', type=str, help='model config file(.yml) path') + parser.add_argument('--model', type=str, help='model weight file(.pth) path') + parser.add_argument('--save_result', action='store_true', default=True, help='save val results to txt') + args = parser.parse_args() + return args + + +def main(args): + warnings.warn('Warning! Old testing code is deprecated and will be deleted ' + 'in next version. Please use tools/test.py') + load_config(cfg, args.config) + local_rank = -1 + torch.backends.cudnn.enabled = True + torch.backends.cudnn.benchmark = True + cfg.defrost() + timestr = datetime.datetime.now().__format__('%Y%m%d%H%M%S') + cfg.save_dir = os.path.join(cfg.save_dir, timestr) + cfg.freeze() + mkdir(local_rank, cfg.save_dir) + logger = Logger(local_rank, cfg.save_dir) + + logger.log('Creating model...') + model = build_model(cfg.model) + + logger.log('Setting up data...') + val_dataset = build_dataset(cfg.data.val, args.task) + val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=1, + pin_memory=True, collate_fn=collate_function, drop_last=True) + trainer = build_trainer(local_rank, cfg, model, logger) + cfg.schedule.update({'load_model': args.model}) + trainer.load_model(cfg) + evaluator = build_evaluator(cfg, val_dataset) + logger.log('Starting testing...') + with torch.no_grad(): + results, val_loss_dict = trainer.run_epoch(0, val_dataloader, mode=args.task) + if args.task == 'test': + res_json = evaluator.results2json(results) + json_path = os.path.join(cfg.save_dir, 'results{}.json'.format(timestr)) + json.dump(res_json, open(json_path, 'w')) + elif args.task == 'val': + eval_results = evaluator.evaluate(results, cfg.save_dir, rank=local_rank) + if args.save_result: + txt_path = os.path.join(cfg.save_dir, "eval_results{}.txt".format(timestr)) + with open(txt_path, "a") as f: + for k, v in eval_results.items(): + f.write("{}: {}\n".format(k, v)) + + +if __name__ == '__main__': + args = parse_args() + main(args) diff --git a/tools/deprecated/train.py b/tools/deprecated/train.py new file mode 100644 index 000000000..7394495fe --- /dev/null +++ b/tools/deprecated/train.py @@ -0,0 +1,95 @@ +import os +import torch +import logging +import warnings +import argparse +import numpy as np +import torch.distributed as dist + +from nanodet.util import mkdir, Logger, cfg, load_config +from nanodet.trainer import build_trainer +from nanodet.data.collate import collate_function +from nanodet.data.dataset import build_dataset +from nanodet.model.arch import build_model +from nanodet.evaluator import build_evaluator + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument('config', help='train config file path') + parser.add_argument('--local_rank', default=-1, type=int, + help='node rank for distributed training') + parser.add_argument('--seed', type=int, default=None, + help='random seed') + args = parser.parse_args() + return args + + +def init_seeds(seed=0): + """ + manually set a random seed for numpy, torch and cuda + :param seed: random seed + """ + torch.manual_seed(seed) + np.random.seed(seed) + torch.cuda.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + if seed == 0: + torch.backends.cudnn.deterministic = True + torch.backends.cudnn.benchmark = False + + +def main(args): + warnings.warn('Warning! Old training code is deprecated and will be deleted ' + 'in next version. Please use tools/train.py') + load_config(cfg, args.config) + local_rank = int(args.local_rank) + torch.backends.cudnn.enabled = True + torch.backends.cudnn.benchmark = True + mkdir(local_rank, cfg.save_dir) + logger = Logger(local_rank, cfg.save_dir) + if args.seed is not None: + logger.log('Set random seed to {}'.format(args.seed)) + init_seeds(args.seed) + + logger.log('Creating model...') + model = build_model(cfg.model) + + logger.log('Setting up data...') + train_dataset = build_dataset(cfg.data.train, 'train') + val_dataset = build_dataset(cfg.data.val, 'test') + + if len(cfg.device.gpu_ids) > 1: + print('rank = ', local_rank) + num_gpus = torch.cuda.device_count() + torch.cuda.set_device(local_rank % num_gpus) + dist.init_process_group(backend='nccl') + train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) + train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, + num_workers=cfg.device.workers_per_gpu, pin_memory=True, + collate_fn=collate_function, sampler=train_sampler, + drop_last=True) + else: + train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, + shuffle=True, num_workers=cfg.device.workers_per_gpu, + pin_memory=True, collate_fn=collate_function, drop_last=True) + + val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=1, + pin_memory=True, collate_fn=collate_function, drop_last=True) + + trainer = build_trainer(local_rank, cfg, model, logger) + + if 'load_model' in cfg.schedule: + trainer.load_model(cfg) + if 'resume' in cfg.schedule: + trainer.resume(cfg) + + evaluator = build_evaluator(cfg, val_dataset) + + logger.log('Starting training...') + trainer.run(train_dataloader, val_dataloader, evaluator) + + +if __name__ == '__main__': + args = parse_args() + main(args) diff --git a/tools/export.py b/tools/export.py index 9a614876a..28e5ba3a2 100644 --- a/tools/export.py +++ b/tools/export.py @@ -31,7 +31,7 @@ def parse_args(): parser.add_argument('--model_path', type=str, default=None, - help='Path to .pth model.') + help='Path to .ckpt model.') parser.add_argument('--out_path', type=str, default='nanodet.onnx', @@ -56,6 +56,6 @@ def parse_args(): input_shape = tuple(map(int, input_shape.split(','))) assert len(input_shape) == 2 if model_path is None: - model_path = os.path.join(cfg.save_dir, "model_best/model_best.pth") + model_path = os.path.join(cfg.save_dir, "model_best/model_best.ckpt") main(cfg, model_path, out_path, input_shape) print("Model saved to:", out_path) diff --git a/tools/test.py b/tools/test.py index 504bbd02f..03e0011c4 100644 --- a/tools/test.py +++ b/tools/test.py @@ -1,14 +1,29 @@ +# Copyright 2021 RangiLyu. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os import torch import json import datetime import argparse +import warnings +import pytorch_lightning as pl -from nanodet.util import mkdir, Logger, cfg, load_config -from nanodet.trainer import build_trainer +from nanodet.util import mkdir, Logger, cfg, load_config, convert_old_model from nanodet.data.collate import collate_function from nanodet.data.dataset import build_dataset -from nanodet.model.arch import build_model +from nanodet.trainer.task import TrainingTask from nanodet.evaluator import build_evaluator @@ -16,8 +31,7 @@ def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('--task', type=str, default='val', help='task to run, test or val') parser.add_argument('--config', type=str, help='model config file(.yml) path') - parser.add_argument('--model', type=str, help='model weight file(.pth) path') - parser.add_argument('--save_result', action='store_true', default=True, help='save val results to txt') + parser.add_argument('--model', type=str, help='ckeckpoint file(.ckpt) path') args = parser.parse_args() return args @@ -30,35 +44,37 @@ def main(args): cfg.defrost() timestr = datetime.datetime.now().__format__('%Y%m%d%H%M%S') cfg.save_dir = os.path.join(cfg.save_dir, timestr) - cfg.freeze() mkdir(local_rank, cfg.save_dir) logger = Logger(local_rank, cfg.save_dir) - logger.log('Creating model...') - model = build_model(cfg.model) + assert args.task in ['val', 'test'] + cfg.update({'test_mode': args.task}) logger.log('Setting up data...') val_dataset = build_dataset(cfg.data.val, args.task) - val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=1, + val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, + num_workers=cfg.device.workers_per_gpu, pin_memory=True, collate_fn=collate_function, drop_last=True) - trainer = build_trainer(local_rank, cfg, model, logger) - cfg.schedule.update({'load_model': args.model}) - trainer.load_model(cfg) evaluator = build_evaluator(cfg, val_dataset) + + logger.log('Creating model...') + task = TrainingTask(cfg, evaluator) + + ckpt = torch.load(args.model) + if 'pytorch-lightning_version' not in ckpt: + warnings.warn('Warning! Old .pth checkpoint is deprecated. ' + 'Convert the checkpoint with tools/convert_old_checkpoint.py ') + ckpt = convert_old_model(ckpt) + task.load_state_dict(ckpt['state_dict']) + + trainer = pl.Trainer(default_root_dir=cfg.save_dir, + gpus=cfg.device.gpu_ids, + accelerator='ddp', + log_every_n_steps=cfg.log.interval, + num_sanity_val_steps=0, + ) logger.log('Starting testing...') - with torch.no_grad(): - results, val_loss_dict = trainer.run_epoch(0, val_dataloader, mode=args.task) - if args.task == 'test': - res_json = evaluator.results2json(results) - json_path = os.path.join(cfg.save_dir, 'results{}.json'.format(timestr)) - json.dump(res_json, open(json_path, 'w')) - elif args.task == 'val': - eval_results = evaluator.evaluate(results, cfg.save_dir, 0, logger, rank=local_rank) - if args.save_result: - txt_path = os.path.join(cfg.save_dir, "eval_results{}.txt".format(timestr)) - with open(txt_path, "a") as f: - for k, v in eval_results.items(): - f.write("{}: {}\n".format(k, v)) + trainer.test(task, val_dataloader) if __name__ == '__main__': diff --git a/tools/train.py b/tools/train.py index 6ab098b53..81eeebbf2 100644 --- a/tools/train.py +++ b/tools/train.py @@ -1,15 +1,29 @@ +# Copyright 2021 RangiLyu. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import os import torch -import logging import argparse import numpy as np -import torch.distributed as dist +import warnings +import pytorch_lightning as pl +from pytorch_lightning.callbacks import ProgressBar -from nanodet.util import mkdir, Logger, cfg, load_config -from nanodet.trainer import build_trainer +from nanodet.util import mkdir, Logger, cfg, load_config, convert_old_model from nanodet.data.collate import collate_function from nanodet.data.dataset import build_dataset -from nanodet.model.arch import build_model +from nanodet.trainer.task import TrainingTask from nanodet.evaluator import build_evaluator @@ -24,20 +38,6 @@ def parse_args(): return args -def init_seeds(seed=0): - """ - manually set a random seed for numpy, torch and cuda - :param seed: random seed - """ - torch.manual_seed(seed) - np.random.seed(seed) - torch.cuda.manual_seed(seed) - torch.cuda.manual_seed_all(seed) - if seed == 0: - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - - def main(args): load_config(cfg, args.config) local_rank = int(args.local_rank) @@ -45,46 +45,50 @@ def main(args): torch.backends.cudnn.benchmark = True mkdir(local_rank, cfg.save_dir) logger = Logger(local_rank, cfg.save_dir) + if args.seed is not None: logger.log('Set random seed to {}'.format(args.seed)) - init_seeds(args.seed) - - logger.log('Creating model...') - model = build_model(cfg.model) + pl.seed_everything(args.seed) logger.log('Setting up data...') train_dataset = build_dataset(cfg.data.train, 'train') val_dataset = build_dataset(cfg.data.val, 'test') - if len(cfg.device.gpu_ids) > 1: - print('rank = ', local_rank) - num_gpus = torch.cuda.device_count() - torch.cuda.set_device(local_rank % num_gpus) - dist.init_process_group(backend='nccl') - train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) - train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, - num_workers=cfg.device.workers_per_gpu, pin_memory=True, - collate_fn=collate_function, sampler=train_sampler, - drop_last=True) - else: - train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, - shuffle=True, num_workers=cfg.device.workers_per_gpu, - pin_memory=True, collate_fn=collate_function, drop_last=True) - - val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=1, + evaluator = build_evaluator(cfg, val_dataset) + + train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, + shuffle=True, num_workers=cfg.device.workers_per_gpu, + pin_memory=True, collate_fn=collate_function, drop_last=True) + # TODO: batch eval + val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, + num_workers=cfg.device.workers_per_gpu, pin_memory=True, collate_fn=collate_function, drop_last=True) - trainer = build_trainer(local_rank, cfg, model, logger) + logger.log('Creating model...') + task = TrainingTask(cfg, evaluator) if 'load_model' in cfg.schedule: - trainer.load_model(cfg) - if 'resume' in cfg.schedule: - trainer.resume(cfg) - - evaluator = build_evaluator(cfg, val_dataset) - - logger.log('Starting training...') - trainer.run(train_dataloader, val_dataloader, evaluator) + ckpt = torch.load(cfg.schedule.load_model) + if 'pytorch-lightning_version' not in ckpt: + warnings.warn('Warning! Old .pth checkpoint is deprecated. ' + 'Convert the checkpoint with tools/convert_old_checkpoint.py ') + ckpt = convert_old_model(ckpt) + task.load_state_dict(ckpt['state_dict']) + + model_resume_path = os.path.join(cfg.save_dir, 'model_last.ckpt') if 'resume' in cfg.schedule else None + + trainer = pl.Trainer(default_root_dir=cfg.save_dir, + max_epochs=cfg.schedule.total_epochs, + gpus=cfg.device.gpu_ids, + check_val_every_n_epoch=cfg.schedule.val_intervals, + accelerator='ddp', + log_every_n_steps=cfg.log.interval, + num_sanity_val_steps=0, + resume_from_checkpoint=model_resume_path, + callbacks=[ProgressBar(refresh_rate=0)] # disable tqdm bar + ) + + trainer.fit(task, train_dataloader, val_dataloader) if __name__ == '__main__': diff --git a/tools/train_pl.py b/tools/train_pl.py deleted file mode 100644 index 246cca8d9..000000000 --- a/tools/train_pl.py +++ /dev/null @@ -1,104 +0,0 @@ -# Copyright 2021 RangiLyu. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import torch -import argparse -import numpy as np -import pytorch_lightning as pl -from pytorch_lightning.callbacks import ProgressBar - -from nanodet.util import mkdir, Logger, cfg, load_config -from nanodet.data.collate import collate_function -from nanodet.data.dataset import build_dataset -from nanodet.trainer.task import TrainingTask -from nanodet.evaluator import build_evaluator - - -def parse_args(): - parser = argparse.ArgumentParser() - parser.add_argument('config', help='train config file path') - parser.add_argument('--local_rank', default=-1, type=int, - help='node rank for distributed training') - parser.add_argument('--seed', type=int, default=None, - help='random seed') - args = parser.parse_args() - return args - - -def init_seeds(seed=0): - """ - manually set a random seed for numpy, torch and cuda - :param seed: random seed - """ - torch.manual_seed(seed) - np.random.seed(seed) - torch.cuda.manual_seed(seed) - torch.cuda.manual_seed_all(seed) - if seed == 0: - torch.backends.cudnn.deterministic = True - torch.backends.cudnn.benchmark = False - - -def main(args): - load_config(cfg, args.config) - local_rank = int(args.local_rank) - torch.backends.cudnn.enabled = True - torch.backends.cudnn.benchmark = True - mkdir(local_rank, cfg.save_dir) - logger = Logger(local_rank, cfg.save_dir) - # TODO: replace with lightning random seed - if args.seed is not None: - logger.log('Set random seed to {}'.format(args.seed)) - init_seeds(args.seed) - - logger.log('Setting up data...') - train_dataset = build_dataset(cfg.data.train, 'train') - val_dataset = build_dataset(cfg.data.val, 'test') - - evaluator = build_evaluator(cfg, val_dataset) - - train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=cfg.device.batchsize_per_gpu, - shuffle=True, num_workers=cfg.device.workers_per_gpu, - pin_memory=True, collate_fn=collate_function, drop_last=True) - # TODO: batch eval - val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=1, - pin_memory=True, collate_fn=collate_function, drop_last=True) - - logger.log('Creating model...') - task = TrainingTask(cfg, evaluator, logger) - - if 'load_model' in cfg.schedule: - ckpt = torch.load(cfg.schedule.load_model) - task.load_state_dict(ckpt['state_dict']) - - model_resume_path = os.path.join(cfg.save_dir, 'model_last.ckpt') if 'resume' in cfg.schedule else None - - trainer = pl.Trainer(default_root_dir=cfg.save_dir, - max_epochs=cfg.schedule.total_epochs, - gpus=cfg.device.gpu_ids, - check_val_every_n_epoch=cfg.schedule.val_intervals, - accelerator='ddp', - log_every_n_steps=cfg.log.interval, - num_sanity_val_steps=0, - resume_from_checkpoint=model_resume_path, - callbacks=[ProgressBar(refresh_rate=0)] # disable tqdm bar - ) - - trainer.fit(task, train_dataloader, val_dataloader) - - -if __name__ == '__main__': - args = parse_args() - main(args)