diff --git a/README.md b/README.md
index 463cfc2..112b403 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,12 @@
 # Deformable Convolutional Networks
 
 
-The major contributors of this repository include [Yuwen Xiong](https://github.com/Orpine), [Haozhi Qi](https://github.com/Oh233), [Guodong Zhang](https://github.com/gd-zhang), [Yi Li](https://github.com/liyi14), [Jifeng Dai](https://github.com/daijifeng001), [Bin Xiao](https://github.com/leoxiaobin) and  [Yichen Wei](https://github.com/YichenWei).
+The major contributors of this repository include [Yuwen Xiong](https://github.com/Orpine), [Haozhi Qi](https://github.com/Oh233), [Guodong Zhang](https://github.com/gd-zhang), [Yi Li](https://github.com/liyi14), [Jifeng Dai](https://github.com/daijifeng001), [Bin Xiao](https://github.com/leoxiaobin), [Han Hu](https://github.com/ancientmooner) and  [Yichen Wei](https://github.com/YichenWei).
 
-## Disclaimer
-
-This is an official implementation for [Deformable Convolutional Networks](https://arxiv.org/abs/1703.06211) (Deformable ConvNets). It is worth noticing that:
 
-  * The original implementation is based on our internal Caffe version on Windows. There are slight differences in the final accuracy and running time due to the plenty details in platform switch.
-  * The code is tested on official [MXNet@(commit 62ecb60)](https://github.com/dmlc/mxnet/tree/62ecb60) with the extra operators for Deformable ConvNets.
-  * We trained our model based on the ImageNet pre-trained [ResNet-v1-101](https://github.com/KaimingHe/deep-residual-networks) using a [model converter](https://github.com/dmlc/mxnet/tree/430ea7bfbbda67d993996d81c7fd44d3a20ef846/tools/caffe_converter). The converted model produces slightly lower accuracy (Top-1 Error on ImageNet val: 24.0% v.s. 23.6%).
-  * By now it only contains Deformable ConvNets with R-FCN. Deformable ConvNets with DeepLab will be released soon.
-  * This repository used code from [MXNet rcnn example](https://github.com/dmlc/mxnet/tree/master/example/rcnn) and [mx-rfcn](https://github.com/giorking/mx-rfcn).
 
 ## Introduction
 
-
 **Deformable ConvNets** is initially described in an [arxiv tech report](https://arxiv.org/abs/1703.06211).
 
 **R-FCN** is initially described in a [NIPS 2016 paper](https://arxiv.org/abs/1605.06409).
@@ -25,7 +16,15 @@ This is an official implementation for [Deformable Convolutional Networks](https
 <img src='demo/deformable_conv_demo2.png' width='800'>
 <img src='demo/deformable_psroipooling_demo.png' width='800'>
 
+## Disclaimer
+
+This is an official implementation for [Deformable Convolutional Networks](https://arxiv.org/abs/1703.06211) (Deformable ConvNets) based on MXNet. It is worth noticing that:
 
+  * The original implementation is based on our internal Caffe version on Windows. There are slight differences in the final accuracy and running time due to the plenty details in platform switch.
+  * The code is tested on official [MXNet@(commit 62ecb60)](https://github.com/dmlc/mxnet/tree/62ecb60) with the extra operators for Deformable ConvNets.
+  * We trained our model based on the ImageNet pre-trained [ResNet-v1-101](https://github.com/KaimingHe/deep-residual-networks) using a [model converter](https://github.com/dmlc/mxnet/tree/430ea7bfbbda67d993996d81c7fd44d3a20ef846/tools/caffe_converter). The converted model produces slightly lower accuracy (Top-1 Error on ImageNet val: 24.0% v.s. 23.6%).
+  * This repository used code from [MXNet rcnn example](https://github.com/dmlc/mxnet/tree/master/example/rcnn) and [mx-rfcn](https://github.com/giorking/mx-rfcn).
+  
 ## License
 
 © Microsoft, 2017. Licensed under an Apache-2.0 license.
@@ -61,21 +60,34 @@ If you find Deformable ConvNets useful in your research, please consider citing:
 |---------------------------------|---------------|---------------|------|---------|---------|-------|-------|-------|
 | <sub>R-FCN, ResNet-v1-101 </sub>           | <sub>coco trainval</sub> | <sub>coco test-dev</sub> | 32.1 | 54.3    |   33.8  | 12.8  | 34.9  | 46.1  | 
 | <sub>Deformable R-FCN, ResNet-v1-101</sub> | <sub>coco trainval</sub> | <sub>coco test-dev</sub> | 35.7 | 56.8    | 38.3    | 15.2  | 38.8  | 51.5  |
+| <sub>Faster R-CNN (2fc), ResNet-v1-101 </sub>           | <sub>coco trainval</sub> | <sub>coco test-dev</sub> | 30.3 | 52.1    |   31.4  | 9.9  | 32.2  | 47.4  | 
+| <sub>Deformable Faster R-CNN (2fc), </br>ResNet-v1-101</sub> | <sub>coco trainval</sub> | <sub>coco test-dev</sub> | 35.0 | 55.0    | 38.3    | 14.3  | 37.7  | 52.0  |
+
+
+
+|                                   | training data              | testing data   | mIoU | time  |
+|-----------------------------------|----------------------------|----------------|------|-------|
+| DeepLab, ResNet-v1-101            | Cityscapes train           | Cityscapes val | 70.3 | 0.51s |
+| Deformable DeepLab, ResNet-v1-101 | Cityscapes train           | Cityscapes val | 75.2 | 0.52s |
+| DeepLab, ResNet-v1-101            | VOC 12 train (augmented) | VOC 12 val   | 70.7 | 0.08s |
+| Deformable DeepLab, ResNet-v1-101 | VOC 12 train (augmented) | VOC 12 val   | 75.9 | 0.08s |
 
 
 *Running time is counted on a single Maxwell Titan X GPU (mini-batch size is 1 in inference).*
 
 ## Requirements: Software
 
-1. MXNet from [offical repository](https://github.com/dmlc/mxnet). We tested our code on [MXNet@(commit 62ecb60)](https://github.com/dmlc/mxnet/tree/62ecb60). Due to the rapid development of MXNet, it is recommended to checkout this version if you have any problems. We may maintain this repository periodically if MXNet adds important feature in future release.
+1. MXNet from [the offical repository](https://github.com/dmlc/mxnet). We tested our code on [MXNet@(commit 62ecb60)](https://github.com/dmlc/mxnet/tree/62ecb60). Due to the rapid development of MXNet, it is recommended to checkout this version if you encounter any issues. We may maintain this repository periodically if MXNet adds important feature in future release.
+
+2. Python 2.7. We recommend using Anaconda2
 
-2. Python packages might missing: cython, opencv-python >= 3.2.0, easydict. If `pip` is set up on your system, those packages should be able to be fetched and installed by running
+3. Python packages might missing: cython, opencv-python >= 3.2.0, easydict. If `pip` is set up on your system, those packages should be able to be fetched and installed by running
 	```
 	pip install Cython
 	pip install opencv-python==3.2.0.6
 	pip install easydict==1.6
 	```
-3. For Windows users, Visual Studio 2015 is needed to compile cython module.
+4. For Windows users, Visual Studio 2015 is needed to compile cython module.
 
 
 ## Requirements: Hardware
@@ -91,18 +103,26 @@ git clone https://github.com/msracver/Deformable-ConvNets.git
 2. For Windows users, run ``cmd .\init.bat``. For Linux user, run `sh ./init.sh`. The scripts will build cython module automatically and create some folders.
 3. Copy operators in `./rfcn/operator_cxx` to `$(YOUR_MXNET_FOLDER)/src/operator/contrib` and recompile MXNet.
 4. Please install MXNet following the official guide of MXNet. For advanced users, you may put your Python packge into `./external/mxnet/$(YOUR_MXNET_PACKAGE)`, and modify `MXNET_VERSION` in `./experiments/rfcn/cfgs/*.yaml` to `$(YOUR_MXNET_PACKAGE)`. Thus you can switch among different versions of MXNet quickly.
+5. For Deeplab, we use the argumented VOC 2012 dataset. The argumented annotations are provided by [SBD](http://home.bharathh.info/pubs/codes/SBD/download.html) dataset. For convenience, we provide the converted PNG annotations and the lists of train/val images, please download them from [OneDrive](https://1drv.ms/u/s!Am-5JzdW2XHzhqMRhVImMI1jRrsxDg).
 
+## Demo & Deformable Model
 
-## Demo
+We provide trained deformable convnet models, including the deformable R-FCN & Faster R-CNN models trained on COCO trainval, and the deformable DeepLab model trained on CityScapes train.
 
-1. To use the demo with our trained model (on COCO trainval), please download the model manually from [OneDrive](https://1drv.ms/u/s!AoN7vygOjLIQqmE7XqFVLbeZDfVN), and put it under folder `model/`.
+1. To use the demo with our pre-trained deformable models, please download manually from [OneDrive](https://1drv.ms/u/s!Am-5JzdW2XHzhqMSjehIcCgAhvEAHw), and put it under folder `model/`.
 
 	Make sure it looks like this:
 	```
 	./model/rfcn_dcn_coco-0000.params
 	./model/rfcn_coco-0000.params
+	./model/rcnn_dcn_coco-0000.params
+	./model/rcnn_coco-0000.params
+	./model/deeplab_dcn_cityscapes-0000.params
+	./model/deeplab_cityscapes-0000.params
+	./model/deform_conv-0000.params
+	./model/deform_psroi-0000.params
 	```
-2. To run the demo, run
+2. To run the R-FCN demo, run
 	```
 	python ./rfcn/demo.py
 	```
@@ -110,15 +130,25 @@ git clone https://github.com/msracver/Deformable-ConvNets.git
 	```
 	python ./rfcn/demo.py --rfcn_only
 	```
-	
-
-
-We will release the visualizaiton tool which visualizes the deformation effects soon.
+3. To run the DeepLab demo, run
+	```
+	python ./deeplab/demo.py
+	```
+	By default it will run Deformable Deeplab and gives several prediction results, to run DeepLab, use
+	```
+	python ./deeplab/demo.py --deeplab_only
+	```
+4. To visualize the offset of deformable convolution and deformable psroipooling, run
+	```
+	python ./rfcn/deform_conv_demo.py
+	python ./rfcn/defrom_psroi_demo.py
+	```
 
 
 ## Preparation for Training & Testing
 
-1. Please download COCO and VOC 2007+2012 dataset, and make sure it looks like this:
+For R-FCN/Faster R-CNN\:
+1. Please download COCO and VOC 2007+2012 datasets, and make sure it looks like this:
 
 	```
 	./data/coco/
@@ -131,10 +161,30 @@ We will release the visualizaiton tool which visualizes the deformation effects
 	./model/pretrained_model/resnet_v1_101-0000.params
 	```
 
+For DeepLab\:
+1. Please download Cityscapes and VOC 2012 datasets and make sure it looks like this:
+
+	```
+	./data/cityscapes/
+	./data/VOCdevkit/VOC2012/
+	```
+2. Please download argumented VOC 2012 annotations/image lists, and put the argumented annotations and the argumented train/val lists into:
+
+	```
+	./data/VOCdevkit/VOC2012/SegmentationClass/
+	./data/VOCdevkit/VOC2012/ImageSets/Main/
+	```
+   , Respectively.
+   
+2. Please download ImageNet-pretrained ResNet-v1-101 model manually from [OneDrive](https://1drv.ms/u/s!Am-5JzdW2XHzhqMEtxf1Ciym8uZ8sg), and put it under folder `./model`. Make sure it looks like this:
+	```
+	./model/pretrained_model/resnet_v1_101-0000.params
+	```
 ## Usage
 
-1. All of our experiment settings (GPU #, dataset, etc.) are kept in yaml files at folder `./experiments/rfcn/cfgs`.
-2. Four config files have been provided so far, namely, R-FCN for COCO/VOC and Deformable R-FCN for COCO/VOC, respectively. We use 8 and 4 GPUs to train models on COCO and on VOC, respectively.
+1. All of our experiment settings (GPU #, dataset, etc.) are kept in yaml config files at folder `./experiments/rfcn/cfgs`, `./experiments/faster_rcnn/cfgs` and `./experiments/deeplab/cfgs/`.
+2. Eight config files have been provided so far, namely, R-FCN for COCO/VOC, Deformable R-FCN for COCO/VOC, Faster R-CNN(2fc) for COCO/VOC, Deformable Faster R-CNN(2fc) for COCO/VOC, Deeplab for Cityscapes/VOC and Deformable Deeplab for Cityscapes/VOC, respectively. We use 8 and 4 GPUs to train models on COCO and on VOC for R-FCN, respectively. For deeplab, we use 4 GPUs for all experiments.
+
 3. To perform experiments, run the python scripts with the corresponding config file as input. For example, to train and test deformable convnets on COCO with ResNet-v1-101, use the following command
     ```
     python experiments\rfcn\rfcn_end2end_train_test.py --cfg experiments\rfcn\cfgs\resnet_v1_101_coco_trainval_rfcn_dcn_end2end_ohem.yaml
@@ -144,11 +194,35 @@ We will release the visualizaiton tool which visualizes the deformation effects
 
 ## Misc.
 
-MXNet build without CuDNN is recommended.
-
 Code has been tested under:
 
 - Ubuntu 14.04 with a Maxwell Titan X GPU and Intel Xeon CPU E5-2620 v2 @ 2.10GHz
 - Windows Server 2012 R2 with 8 K40 GPUs and Intel Xeon CPU E5-2650 v2 @ 2.60GHz
 - Windows Server 2012 R2 with 4 Pascal Titan X GPUs and Intel Xeon CPU E5-2650 v4 @ 2.30GHz
 
+## FAQ
+
+Q: It says `AttributeError: 'module' object has no attribute 'DeformableConvolution'`.
+
+A: This is because either
+ - you forget to copy the operators to your MXNet folder
+ - or you copy to the wrong path
+ - or you forget to re-compile
+ - or you install the wrong MXNet
+
+    Please print `mxnet.__path__` to make sure you use correct MXNet
+
+<br/><br/>
+Q: I encounter `segment fault` at the beginning.
+
+A: A compatibility issue has been identified between MXNet and opencv-python 3.0+. We suggest that you always `import cv2` first before `import mxnet` in the entry script. 
+
+<br/><br/>
+Q: I find the training speed becomes slower when training for a long time.
+
+A: It has been identified that MXNet on Windows has this problem. So we recommend to run this program on Linux. You could also stop it and resume the training process to regain the training speed if you encounter this problem.
+
+<br/><br/>
+Q: Can you share your caffe implementation?
+
+A: Due to several reasons (code is based on a old, internal Caffe, port to public Caffe needs extra work, time limit, etc.). We do not plan to release our Caffe code. Since current MXNet convolution implementation is very similar to Caffe (almost the same), it is easy to port to Caffe by yourself, the core CUDA code could be kept unchanged. Anyone who wish to do it is welcome to make a pull request.
diff --git a/deeplab/_init_paths.py b/deeplab/_init_paths.py
new file mode 100644
index 0000000..5e9b023
--- /dev/null
+++ b/deeplab/_init_paths.py
@@ -0,0 +1,19 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import os.path as osp
+import sys
+
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+
+this_dir = osp.dirname(__file__)
+
+lib_path = osp.join(this_dir, '..', 'lib')
+add_path(lib_path)
diff --git a/deeplab/config/__init__.py b/deeplab/config/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/deeplab/config/config.py b/deeplab/config/config.py
new file mode 100644
index 0000000..cae1c8d
--- /dev/null
+++ b/deeplab/config/config.py
@@ -0,0 +1,96 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import yaml
+import numpy as np
+from easydict import EasyDict as edict
+
+config = edict()
+
+config.MXNET_VERSION = ''
+config.output_path = ''
+config.symbol = ''
+config.gpus = ''
+config.CLASS_AGNOSTIC = True
+config.SCALES = [(360, 600)]  # first is scale (the shorter side); second is max size
+
+# default training
+config.default = edict()
+config.default.frequent = 1000
+config.default.kvstore = 'device'
+
+# network related params
+config.network = edict()
+config.network.pretrained = '../model/pretrained_model/resnet_v1-101'
+config.network.pretrained_epoch = 0
+config.network.PIXEL_MEANS = np.array([103.06, 115.90, 123.15])
+config.network.IMAGE_STRIDE = 0
+config.network.FIXED_PARAMS = ['conv1', 'bn_conv1', 'res2', 'bn2', 'gamma', 'beta']
+
+# dataset related params
+config.dataset = edict()
+config.dataset.dataset = 'cityscapes'
+config.dataset.image_set = 'leftImg8bit_train'
+config.dataset.test_image_set = 'leftImg8bit_val'
+config.dataset.root_path = '../data'
+config.dataset.dataset_path = '../data/cityscapes'
+config.dataset.NUM_CLASSES = 19
+config.dataset.annotation_prefix = 'gtFine'
+
+config.TRAIN = edict()
+config.TRAIN.lr = 0
+config.TRAIN.lr_step = ''
+config.TRAIN.warmup = False
+config.TRAIN.warmup_lr = 0
+config.TRAIN.warmup_step = 0
+config.TRAIN.momentum = 0.9
+config.TRAIN.wd = 0.0005
+config.TRAIN.begin_epoch = 0
+config.TRAIN.end_epoch = 0
+config.TRAIN.model_prefix = 'deeplab'
+
+# whether resume training
+config.TRAIN.RESUME = False
+# whether flip image
+config.TRAIN.FLIP = True
+# whether shuffle image
+config.TRAIN.SHUFFLE = True
+# whether use OHEM
+config.TRAIN.ENABLE_OHEM = False
+# size of images for each device, 2 for rcnn, 1 for rpn and e2e
+config.TRAIN.BATCH_IMAGES = 1
+
+config.TEST = edict()
+# size of images for each device
+config.TEST.BATCH_IMAGES = 1
+
+# Test Model Epoch
+config.TEST.test_epoch = 0
+
+def update_config(config_file):
+    exp_config = None
+    with open(config_file) as f:
+        exp_config = edict(yaml.load(f))
+        for k, v in exp_config.items():
+            if k in config:
+                if isinstance(v, dict):
+                    if k == 'TRAIN':
+                        if 'BBOX_WEIGHTS' in v:
+                            v['BBOX_WEIGHTS'] = np.array(v['BBOX_WEIGHTS'])
+                    elif k == 'network':
+                        if 'PIXEL_MEANS' in v:
+                            v['PIXEL_MEANS'] = np.array(v['PIXEL_MEANS'])
+                    for vk, vv in v.items():
+                        config[k][vk] = vv
+                else:
+                    if k == 'SCALES':
+                        config[k][0] = (tuple(v))
+                    else:
+                        config[k] = v
+            else:
+                raise ValueError("key must exist in config.py")
diff --git a/deeplab/core/DataParallelExecutorGroup.py b/deeplab/core/DataParallelExecutorGroup.py
new file mode 100644
index 0000000..15c8469
--- /dev/null
+++ b/deeplab/core/DataParallelExecutorGroup.py
@@ -0,0 +1,603 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import logging
+import numpy as np
+
+from mxnet import context as ctx
+from mxnet import ndarray as nd
+from mxnet.io import DataDesc
+from mxnet.executor_manager import _split_input_slice
+
+def _load_general(data, targets, major_axis):
+    """Load a list of arrays into a list of arrays specified by slices"""
+    for d_src, d_targets in zip(data, targets):
+        if isinstance(d_targets, nd.NDArray):
+            d_src.copyto(d_targets)
+        elif isinstance(d_src, (list, tuple)):
+            for src, dst in zip(d_src, d_targets):
+                src.copyto(dst)
+        else:
+            raise NotImplementedError
+
+
+def _load_data(batch, targets, major_axis):
+    """Load data into sliced arrays"""
+    _load_general(batch.data, targets, major_axis)
+
+
+def _load_label(batch, targets, major_axis):
+    """Load label into sliced arrays"""
+    _load_general(batch.label, targets, major_axis)
+
+
+def _merge_multi_context(outputs, major_axis):
+    """Merge outputs that lives on multiple context into one, so that they look
+    like living on one context.
+    """
+    rets = []
+    for tensors, axis in zip(outputs, major_axis):
+        if axis >= 0:
+            rets.append(nd.concatenate(tensors, axis=axis, always_copy=False))
+        else:
+            # negative axis means the there is no batch_size axis, and all the
+            # results should be the same on each device. We simply take the
+            # first one, without checking they are actually the same
+            rets.append(tensors[0])
+    return rets
+
+
+
+class DataParallelExecutorGroup(object):
+    """DataParallelExecutorGroup is a group of executors that lives on a group of devices.
+    This is a helper class used to implement data parallelization. Each mini-batch will
+    be split and run on the devices.
+
+    Parameters
+    ----------
+    symbol : Symbol
+        The common symbolic computation graph for all executors.
+    contexts : list
+        A list of contexts.
+    workload : list
+        If not `None`, could be a list of numbers that specify the workload to be assigned
+        to different context. Larger number indicate heavier workload.
+    data_shapes : list
+        Should be a list of (name, shape) tuples, for the shapes of data. Note the order is
+        important and should be the same as the order that the `DataIter` provide the data.
+    label_shapes : list
+        Should be a list of (name, shape) tuples, for the shapes of label. Note the order is
+        important and should be the same as the order that the `DataIter` provide the label.
+    param_names : list
+        A list of strings, indicating the names of parameters (e.g. weights, filters, etc.)
+        in the computation graph.
+    for_training : bool
+        Indicate whether the executors should be bind for training. When not doing training,
+        the memory for gradients will not be allocated.
+    inputs_need_grad : bool
+        Indicate whether the gradients for the input data should be computed. This is currently
+        not used. It will be useful for implementing composition of modules.
+    shared_group : DataParallelExecutorGroup
+        Default is `None`. This is used in bucketing. When not `None`, it should be a executor
+        group corresponding to a different bucket. In other words, it will correspond to a different
+        symbol but with the same set of parameters (e.g. unrolled RNNs with different lengths).
+        In this case, many memory will be shared.
+    logger : Logger
+        Default is `logging`.
+    fixed_param_names: list of str
+        Indicate parameters to be fixed during training. Parameters in this list will not allocate
+        space for gradient, nor do gradient calculation.
+    grad_req : str, list of str, dict of str to str
+        Requirement for gradient accumulation. Can be 'write', 'add', or 'null'
+        (default to 'write').
+        Can be specified globally (str) or for each argument (list, dict).
+    """
+    def __init__(self, symbol, contexts, workload, data_shapes, label_shapes, param_names,
+                 for_training, inputs_need_grad, shared_group=None, logger=logging,
+                 fixed_param_names=None, grad_req='write', state_names=None):
+        self.param_names = param_names
+        self.arg_names = symbol.list_arguments()
+        self.aux_names = symbol.list_auxiliary_states()
+
+        self.symbol = symbol
+        self.contexts = contexts
+        self.workload = workload
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+
+        self.logger = logger
+        #In the future we should have a better way to profile memory per device (haibin)
+        # self._total_exec_bytes = 0
+        self.fixed_param_names = fixed_param_names
+        if self.fixed_param_names is None:
+            self.fixed_param_names = []
+
+        self.state_names = state_names
+        if self.state_names is None:
+            self.state_names = []
+
+        if not for_training:
+            grad_req = 'null'
+
+        # data_shapes = [x if isinstance(x, DataDesc) else DataDesc(*x) for x in data_shapes]
+        # if label_shapes is not None:
+        #     label_shapes = [x if isinstance(x, DataDesc) else DataDesc(*x) for x in label_shapes]
+
+        data_names = [x.name for x in data_shapes[0]]
+
+        if isinstance(grad_req, str):
+            self.grad_req = {}
+            for k in self.arg_names:
+                if k in self.param_names:
+                    self.grad_req[k] = 'null' if k in self.fixed_param_names else grad_req
+                elif k in data_names:
+                    self.grad_req[k] = grad_req if self.inputs_need_grad else 'null'
+                else:
+                    self.grad_req[k] = 'null'
+        elif isinstance(grad_req, (list, tuple)):
+            assert len(grad_req) == len(self.arg_names)
+            self.grad_req = dict(zip(self.arg_names, grad_req))
+        elif isinstance(grad_req, dict):
+            self.grad_req = {}
+            for k in self.arg_names:
+                if k in self.param_names:
+                    self.grad_req[k] = 'null' if k in self.fixed_param_names else 'write'
+                elif k in data_names:
+                    self.grad_req[k] = 'write' if self.inputs_need_grad else 'null'
+                else:
+                    self.grad_req[k] = 'null'
+            self.grad_req.update(grad_req)
+        else:
+            raise ValueError("grad_req must be one of str, list, tuple, or dict.")
+
+        if shared_group is not None:
+            self.shared_data_arrays = shared_group.shared_data_arrays
+        else:
+            self.shared_data_arrays = [{} for _ in contexts]
+
+        # initialize some instance variables
+        self.batch_size = len(data_shapes)
+        self.slices = None
+        self.execs = []
+        self._default_execs = None
+        self.data_arrays = None
+        self.label_arrays = None
+        self.param_arrays = None
+        self.state_arrays = None
+        self.grad_arrays = None
+        self.aux_arrays = None
+        self.input_grad_arrays = None
+
+        self.data_shapes = None
+        self.label_shapes = None
+        self.data_layouts = None
+        self.label_layouts = None
+        self.output_layouts = [DataDesc.get_batch_axis(self.symbol[name].attr('__layout__'))
+                               for name in self.symbol.list_outputs()]
+        self.bind_exec(data_shapes, label_shapes, shared_group)
+
+    def decide_slices(self, data_shapes):
+        """Decide the slices for each context according to the workload.
+
+        Parameters
+        ----------
+        data_shapes : list
+            list of (name, shape) specifying the shapes for the input data or label.
+        """
+        assert len(data_shapes) > 0
+        major_axis = [DataDesc.get_batch_axis(x.layout) for x in data_shapes]
+
+        for (name, shape), axis in zip(data_shapes, major_axis):
+            if axis == -1:
+                continue
+
+            batch_size = shape[axis]
+            if self.batch_size is not None:
+                assert batch_size == self.batch_size, ("all data must have the same batch size: "
+                                                       + ("batch_size = %d, but " % self.batch_size)
+                                                       + ("%s has shape %s" % (name, shape)))
+            else:
+                self.batch_size = batch_size
+                self.slices = _split_input_slice(self.batch_size, self.workload)
+
+        return major_axis
+
+    def _collect_arrays(self):
+        """Collect internal arrays from executors."""
+        # convenient data structures
+        # self.data_arrays = [[(self.slices[i], e.arg_dict[name]) for i, e in enumerate(self.execs)]
+        #                     for name, _ in self.data_shapes]
+        self.data_arrays = [[e.arg_dict[name] for name, _ in self.data_shapes[0]] for e in self.execs]
+
+        self.state_arrays = [[e.arg_dict[name] for e in self.execs]
+                             for name in self.state_names]
+
+        if self.label_shapes is not None:
+            # self.label_arrays = [[(self.slices[i], e.arg_dict[name])
+            #                       for i, e in enumerate(self.execs)]
+            #                      for name, _ in self.label_shapes]
+            self.label_arrays = [[e.arg_dict[name] for name, _ in self.label_shapes[0]] for e in self.execs]
+        else:
+            self.label_arrays = None
+
+        self.param_arrays = [[exec_.arg_arrays[i] for exec_ in self.execs]
+                             for i, name in enumerate(self.arg_names)
+                             if name in self.param_names]
+        if self.for_training:
+            self.grad_arrays = [[exec_.grad_arrays[i] for exec_ in self.execs]
+                                for i, name in enumerate(self.arg_names)
+                                if name in self.param_names]
+        else:
+            self.grad_arrays = None
+
+        data_names = [x[0] for x in self.data_shapes]
+        if self.inputs_need_grad:
+            self.input_grad_arrays = [[exec_.grad_arrays[i] for exec_ in self.execs]
+                                      for i, name in enumerate(self.arg_names)
+                                      if name in data_names]
+        else:
+            self.input_grad_arrays = None
+
+        self.aux_arrays = [[exec_.aux_arrays[i] for exec_ in self.execs]
+                           for i in range(len(self.aux_names))]
+
+    def bind_exec(self, data_shapes, label_shapes, shared_group=None, reshape=False):
+        """Bind executors on their respective devices.
+
+        Parameters
+        ----------
+        data_shapes : list
+        label_shapes : list
+        shared_group : DataParallelExecutorGroup
+        reshape : bool
+        """
+        assert reshape or not self.execs
+        # self.batch_size = None
+
+        # calculate workload and bind executors
+        # self.data_layouts = self.decide_slices(data_shapes)
+        # if label_shapes is not None:
+        #     # call it to make sure labels has the same batch size as data
+        #     self.label_layouts = self.decide_slices(label_shapes)
+
+        for i in range(len(self.contexts)):
+            # data_shapes_i = self._sliced_shape(data_shapes, i, self.data_layouts)
+            data_shapes_i = data_shapes[i]
+            if label_shapes is not None:
+                label_shapes_i = label_shapes[i]
+                # label_shapes_i = self._sliced_shape(label_shapes, i, self.label_layouts)
+            else:
+                label_shapes_i = []
+
+            if reshape:
+                self.execs[i] = self._default_execs[i].reshape(
+                    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
+            else:
+                self.execs.append(self._bind_ith_exec(i, data_shapes_i, label_shapes_i,
+                                                      shared_group))
+
+        self.data_shapes = data_shapes
+        self.label_shapes = label_shapes
+        self._collect_arrays()
+
+    def reshape(self, data_shapes, label_shapes):
+        """Reshape executors.
+
+        Parameters
+        ----------
+        data_shapes : list
+        label_shapes : list
+        """
+        if self._default_execs is None:
+            self._default_execs = [i for i in self.execs]
+        for i in range(len(self.contexts)):
+            self.execs[i] = self._default_execs[i].reshape(
+                allow_up_sizing=True, **dict(data_shapes[i] + (label_shapes[i] if label_shapes is not None else []))
+            )
+        self.data_shapes = data_shapes
+        self.label_shapes = label_shapes
+        self._collect_arrays()
+
+
+    def set_params(self, arg_params, aux_params):
+        """Assign, i.e. copy parameters to all the executors.
+
+        Parameters
+        ----------
+        arg_params : dict
+            A dictionary of name to `NDArray` parameter mapping.
+        aux_params : dict
+            A dictionary of name to `NDArray` auxiliary variable mapping.
+        """
+        for exec_ in self.execs:
+            exec_.copy_params_from(arg_params, aux_params)
+
+    def get_params(self, arg_params, aux_params):
+        """ Copy data from each executor to `arg_params` and `aux_params`.
+
+        Parameters
+        ----------
+        arg_params : list of NDArray
+            target parameter arrays
+        aux_params : list of NDArray
+            target aux arrays
+
+        Notes
+        -----
+        - This function will inplace update the NDArrays in arg_params and aux_params.
+        """
+        for name, block in zip(self.param_names, self.param_arrays):
+            weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block)
+            weight.astype(arg_params[name].dtype).copyto(arg_params[name])
+        for name, block in zip(self.aux_names, self.aux_arrays):
+            weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block)
+            weight.astype(aux_params[name].dtype).copyto(aux_params[name])
+
+    def forward(self, data_batch, is_train=None):
+        """Split `data_batch` according to workload and run forward on each devices.
+
+        Parameters
+        ----------
+        data_batch : DataBatch
+            Or could be any object implementing similar interface.
+        is_train : bool
+            The hint for the backend, indicating whether we are during training phase.
+            Default is `None`, then the value `self.for_training` will be used.
+        Returns
+        -------
+
+        """
+        _load_data(data_batch, self.data_arrays, self.data_layouts)
+        if is_train is None:
+            is_train = self.for_training
+
+        if self.label_arrays is not None:
+            assert not is_train or data_batch.label
+            if data_batch.label:
+                _load_label(data_batch, self.label_arrays, self.label_layouts)
+
+        for exec_ in self.execs:
+            exec_.forward(is_train=is_train)
+
+    def get_outputs(self, merge_multi_context=True):
+        """Get outputs of the previous forward computation.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        outputs = [[exec_.outputs[i] for exec_ in self.execs]
+                   for i in range(len(self.execs[0].outputs))]
+        if merge_multi_context:
+            outputs = _merge_multi_context(outputs, self.output_layouts)
+        return outputs
+
+    def get_states(self, merge_multi_context=True):
+        """Get states from all devices
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the states
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert not merge_multi_context, \
+            "merge_multi_context=True is not supported for get_states yet."
+        return self.state_arrays
+
+    def set_states(self, states=None, value=None):
+        """Set value for states. Only one of states & value can be specified.
+
+        Parameters
+        ----------
+        states : list of list of NDArrays
+            source states arrays formatted like [[state1_dev1, state1_dev2],
+            [state2_dev1, state2_dev2]].
+        value : number
+            a single scalar value for all state arrays.
+        """
+        if states is not None:
+            assert value is None, "Only one of states & value can be specified."
+            _load_general(states, self.state_arrays, (0,)*len(states))
+        else:
+            assert value is not None, "At least one of states & value must be specified."
+            assert states is None, "Only one of states & value can be specified."
+            for d_dst in self.state_arrays:
+                for dst in d_dst:
+                    dst[:] = value
+
+    def get_input_grads(self, merge_multi_context=True):
+        """Get the gradients with respect to the inputs of the module.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[grad1, grad2]`. Otherwise, it
+        is like `[[grad1_dev1, grad1_dev2], [grad2_dev1, grad2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.inputs_need_grad
+        if merge_multi_context:
+            return _merge_multi_context(self.input_grad_arrays, self.data_layouts)
+        return self.input_grad_arrays
+
+    def backward(self, out_grads=None):
+        """Run backward on all devices. A backward should be called after
+        a call to the forward function. Backward cannot be called unless
+        `self.for_training` is `True`.
+
+        Parameters
+        ----------
+        out_grads : NDArray or list of NDArray, optional
+            Gradient on the outputs to be propagated back.
+            This parameter is only needed when bind is called
+            on outputs that are not a loss function.
+        """
+        assert self.for_training, 're-bind with for_training=True to run backward'
+        if out_grads is None:
+            out_grads = []
+
+        # for i, (exec_, islice) in enumerate(zip(self.execs, self.slices)):
+        for i, exec_ in enumerate(self.execs):
+            out_grads_slice = []
+            exec_.backward(out_grads=out_grads_slice)
+
+    def update_metric(self, eval_metric, labels):
+        """Accumulate the performance according to `eval_metric` on all devices.
+
+        Parameters
+        ----------
+        eval_metric : EvalMetric
+            The metric used for evaluation.
+        labels : list of NDArray
+            Typically comes from `label` of a `DataBatch`.
+        """
+        for texec, labels in zip(self.execs, labels):
+            eval_metric.update(labels, texec.outputs)
+
+    def _bind_ith_exec(self, i, data_shapes, label_shapes, shared_group):
+        """Internal utility function to bind the i-th executor.
+        """
+        shared_exec = None if shared_group is None else shared_group.execs[i]
+        context = self.contexts[i]
+        shared_data_arrays = self.shared_data_arrays[i]
+
+        input_shapes = dict(data_shapes)
+        if label_shapes is not None:
+            input_shapes.update(dict(label_shapes))
+
+        arg_shapes, _, aux_shapes = self.symbol.infer_shape(**input_shapes)
+        assert arg_shapes is not None, "shape inference failed"
+
+        input_types = {x.name: x.dtype for x in data_shapes}
+        if label_shapes is not None:
+            input_types.update({x.name: x.dtype for x in label_shapes})
+        arg_types, _, aux_types = self.symbol.infer_type(**input_types)
+        assert arg_types is not None, "type inference failed"
+
+        arg_arrays = []
+        grad_arrays = {} if self.for_training else None
+
+        def _get_or_reshape(name, shared_data_arrays, arg_shape, arg_type, context, logger):
+            """Internal helper to get a memory block or re-use by re-shaping"""
+            if name in shared_data_arrays:
+                arg_arr = shared_data_arrays[name]
+
+                if np.prod(arg_arr.shape) >= np.prod(arg_shape):
+                    # nice, we can directly re-use this data blob
+                    assert arg_arr.dtype == arg_type
+                    arg_arr = arg_arr.reshape(arg_shape)
+                else:
+                    logger.warning(('bucketing: data "%s" has a shape %s' % (name, arg_shape)) +
+                                   (', which is larger than already allocated ') +
+                                   ('shape %s' % (arg_arr.shape,)) +
+                                   ('. Need to re-allocate. Consider putting ') +
+                                   ('default_bucket_key to') +
+                                   (' be the bucket taking the largest input for better ') +
+                                   ('memory sharing.'))
+                    arg_arr = nd.zeros(arg_shape, context, dtype=arg_type)
+
+                    # replace existing shared array because the new one is bigger
+                    shared_data_arrays[name] = arg_arr
+            else:
+                arg_arr = nd.zeros(arg_shape, context, dtype=arg_type)
+                shared_data_arrays[name] = arg_arr
+
+            return arg_arr
+
+        # create or borrow arguments and gradients
+        for j in range(len(self.arg_names)):
+            name = self.arg_names[j]
+            if name in self.param_names: # model parameters
+                if shared_exec is None:
+                    arg_arr = nd.zeros(arg_shapes[j], context, dtype=arg_types[j])
+                    if self.grad_req[name] != 'null':
+                        grad_arr = nd.zeros(arg_shapes[j], context, dtype=arg_types[j])
+                        grad_arrays[name] = grad_arr
+                else:
+                    arg_arr = shared_exec.arg_dict[name]
+                    assert arg_arr.shape == arg_shapes[j]
+                    assert arg_arr.dtype == arg_types[j]
+                    if self.grad_req[name] != 'null':
+                        grad_arrays[name] = shared_exec.grad_dict[name]
+            else: # data, label, or states
+                arg_arr = _get_or_reshape(name, shared_data_arrays, arg_shapes[j], arg_types[j],
+                                          context, self.logger)
+
+                # data might also need grad if inputs_need_grad is True
+                if self.grad_req[name] != 'null':
+                    grad_arrays[name] = _get_or_reshape('grad of ' + name, shared_data_arrays,
+                                                        arg_shapes[j], arg_types[j], context,
+                                                        self.logger)
+
+            arg_arrays.append(arg_arr)
+
+        # create or borrow aux variables
+        if shared_exec is None:
+            aux_arrays = [nd.zeros(s, context, dtype=t) for s, t in zip(aux_shapes, aux_types)]
+        else:
+            for j, arr in enumerate(shared_exec.aux_arrays):
+                assert aux_shapes[j] == arr.shape
+                assert aux_types[j] == arr.dtype
+            aux_arrays = shared_exec.aux_arrays[:]
+
+        executor = self.symbol.bind(ctx=context, args=arg_arrays,
+                                    args_grad=grad_arrays, aux_states=aux_arrays,
+                                    grad_req=self.grad_req, shared_exec=shared_exec)
+        # Get the total bytes allocated for this executor
+        # self._total_exec_bytes += int(executor.debug_str().split('\n')[-3].split()[1])
+        return executor
+
+    def _sliced_shape(self, shapes, i, major_axis):
+        """Get the sliced shapes for the i-th executor.
+
+        Parameters
+        ----------
+        shapes : list of (str, tuple)
+            The original (name, shape) pairs.
+        i : int
+            Which executor we are dealing with.
+        """
+        sliced_shapes = []
+        for desc, axis in zip(shapes, major_axis):
+            shape = list(desc.shape)
+            if axis >= 0:
+                shape[axis] = self.slices[i].stop - self.slices[i].start
+            sliced_shapes.append(DataDesc(desc.name, tuple(shape), desc.dtype, desc.layout))
+        return sliced_shapes
+
+    def install_monitor(self, mon):
+        """Install monitor on all executors"""
+        for exe in self.execs:
+            mon.install(exe)
diff --git a/deeplab/core/__init__.py b/deeplab/core/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/deeplab/core/callback.py b/deeplab/core/callback.py
new file mode 100644
index 0000000..f970d5e
--- /dev/null
+++ b/deeplab/core/callback.py
@@ -0,0 +1,45 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import time
+import logging
+import mxnet as mx
+
+class Speedometer(object):
+    def __init__(self, batch_size, frequent=50):
+        self.batch_size = batch_size
+        self.frequent = frequent
+        self.init = False
+        self.tic = 0
+        self.last_count = 0
+
+    def __call__(self, param):
+        """Callback to Show speed."""
+        count = param.nbatch
+        if self.last_count > count:
+            self.init = False
+        self.last_count = count
+
+        if self.init:
+            if count % self.frequent == 0:
+                speed = self.frequent * self.batch_size / (time.time() - self.tic)
+                s = ''
+                if param.eval_metric is not None:
+                    name, value = param.eval_metric.get()
+                    s = "Epoch[%d] Batch [%d]\tSpeed: %.2f samples/sec\tTrain-" % (param.epoch, count, speed)
+                    for n, v in zip(name, value):
+                        s += "%s=%f,\t" % (n, v)
+                else:
+                    s = "Iter[%d] Batch [%d]\tSpeed: %.2f samples/sec" % (param.epoch, count, speed)
+
+                logging.info(s)
+                print(s)
+                self.tic = time.time()
+        else:
+            self.init = True
+            self.tic = time.time()
diff --git a/deeplab/core/loader.py b/deeplab/core/loader.py
new file mode 100644
index 0000000..c796fae
--- /dev/null
+++ b/deeplab/core/loader.py
@@ -0,0 +1,266 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import numpy as np
+import mxnet as mx
+import random
+import math
+
+from mxnet.executor_manager import _split_input_slice
+from utils.image import tensor_vstack
+from segmentation.segmentation import get_segmentation_train_batch, get_segmentation_test_batch
+from PIL import Image
+from multiprocessing import Pool
+
+class TestDataLoader(mx.io.DataIter):
+    def __init__(self, segdb, config, batch_size=1, shuffle=False):
+        super(TestDataLoader, self).__init__()
+
+        # save parameters as properties
+        self.segdb = segdb
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.config = config
+
+        # infer properties from roidb
+        self.size = len(self.segdb)
+        self.index = np.arange(self.size)
+
+        # decide data and label names (only for training)
+        self.data_name = ['data']
+        self.label_name = None
+
+        # status variable for synchronization between get_data and get_label
+        self.cur = 0
+        self.data = None
+        self.label = []
+        self.im_info = None
+
+        # get first batch to fill in provide_data and provide_label
+        self.reset()
+        self.get_batch()
+
+    @property
+    def provide_data(self):
+        return [[(k, v.shape) for k, v in zip(self.data_name, self.data[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_label(self):
+        return [None for i in xrange(len(self.data))]
+
+    @property
+    def provide_data_single(self):
+        return [(k, v.shape) for k, v in zip(self.data_name, self.data[0])]
+
+    @property
+    def provide_label_single(self):
+        return None
+
+    def reset(self):
+        self.cur = 0
+        if self.shuffle:
+            np.random.shuffle(self.index)
+
+    def iter_next(self):
+        return self.cur < self.size
+
+    def next(self):
+        if self.iter_next():
+            self.get_batch()
+            self.cur += self.batch_size
+            return mx.io.DataBatch(data=self.data, label=self.label,
+                                   pad=self.getpad(), index=self.getindex(),
+                                   provide_data=self.provide_data, provide_label=self.provide_label)
+        else:
+            raise StopIteration
+
+    def getindex(self):
+        return self.cur / self.batch_size
+
+    def getpad(self):
+        if self.cur + self.batch_size > self.size:
+            return self.cur + self.batch_size - self.size
+        else:
+            return 0
+
+    def get_batch(self):
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        segdb = [self.segdb[self.index[i]] for i in range(cur_from, cur_to)]
+
+        data, label, im_info = get_segmentation_test_batch(segdb, self.config)
+
+        self.data = [[mx.nd.array(data[i][name]) for name in self.data_name] for i in xrange(len(data))]
+        self.im_info = im_info
+
+class TrainDataLoader(mx.io.DataIter):
+    def __init__(self, sym, segdb, config, batch_size=1, crop_height = 768, crop_width = 1024, shuffle=False, ctx=None, work_load_list=None):
+        """
+        This Iter will provide seg data to Deeplab network
+        :param sym: to infer shape
+        :param segdb: must be preprocessed
+        :param config: config file
+        :param batch_size: must divide BATCH_SIZE(128)
+        :param crop_height: the height of cropped image
+        :param crop_width: the width of cropped image
+        :param shuffle: bool
+        :param ctx: list of contexts
+        :param work_load_list: list of work load
+        :return: DataLoader
+        """
+        super(TrainDataLoader, self).__init__()
+
+        # save parameters as properties
+        self.sym = sym
+        self.segdb = segdb
+        self.config = config
+        self.batch_size = batch_size
+        if self.config.TRAIN.ENABLE_CROP:
+            self.crop_height = crop_height
+            self.crop_width = crop_width
+        else:
+            self.crop_height = None
+            self.crop_width = None
+
+        self.shuffle = shuffle
+        self.ctx = ctx
+
+        if self.ctx is None:
+            self.ctx = [mx.cpu()]
+        self.work_load_list = work_load_list
+
+        # infer properties from segdb
+        self.size = len(segdb)
+        self.index = np.arange(self.size)
+
+        # decide data and label names
+        self.data_name = ['data']
+        self.label_name = ['label']
+
+        # status variable for synchronization between get_data and get_label
+        self.cur = 0
+        self.batch = None
+        self.data = None
+        self.label = None
+
+        # init multi-process pool
+        self.pool = Pool(processes = len(self.ctx))
+
+        # get first batch to fill in provide_data and provide_label
+        self.reset()
+        self.get_batch_parallel()
+        random.seed()
+
+    @property
+    def provide_data(self):
+        return [[(k, v.shape) for k, v in zip(self.data_name, self.data[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_label(self):
+        return [[(k, v.shape) for k, v in zip(self.label_name, self.label[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_data_single(self):
+        return [(k, v.shape) for k, v in zip(self.data_name, self.data[0])]
+
+    @property
+    def provide_label_single(self):
+        return [(k, v.shape) for k, v in zip(self.label_name, self.label[0])]
+
+    def reset(self):
+        self.cur = 0
+        if self.shuffle:
+            np.random.shuffle(self.index)
+
+    def iter_next(self):
+        return self.cur + self.batch_size <= self.size
+
+    def next(self):
+        if self.iter_next():
+            self.get_batch_parallel()
+            self.cur += self.batch_size
+            return mx.io.DataBatch(data=self.data, label=self.label,
+                                   pad=self.getpad(), index=self.getindex(),
+                                   provide_data=self.provide_data, provide_label=self.provide_label)
+        else:
+            raise StopIteration
+
+    def getindex(self):
+        return self.cur / self.batch_size
+
+    def getpad(self):
+        if self.cur + self.batch_size > self.size:
+            return self.cur + self.batch_size - self.size
+        else:
+            return 0
+
+    def infer_shape(self, max_data_shape=None, max_label_shape=None):
+        """ Return maximum data and label shape for single gpu """
+        if max_data_shape is None:
+            max_data_shape = []
+        if max_label_shape is None:
+            max_label_shape = []
+
+        max_shapes = dict(max_data_shape + max_label_shape)
+        _, label_shape, _ = self.sym.infer_shape(**max_shapes)
+        label_shape = [(self.label_name[0], label_shape)]
+        return max_data_shape, label_shape
+
+    def get_batch_parallel(self):
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        segdb = [self.segdb[self.index[i]] for i in range(cur_from, cur_to)]
+
+        # decide multi device slice
+        work_load_list = self.work_load_list
+        ctx = self.ctx
+        if work_load_list is None:
+            work_load_list = [1] * len(ctx)
+        assert isinstance(work_load_list, list) and len(work_load_list) == len(ctx), \
+            "Invalid settings for work load. "
+        slices = _split_input_slice(self.batch_size, work_load_list)
+
+        multiprocess_results = []
+
+        for idx, islice in enumerate(slices):
+            isegdb = [segdb[i] for i in range(islice.start, islice.stop)]
+            multiprocess_results.append(self.pool.apply_async(parfetch, (self.config, self.crop_width, self.crop_height, isegdb)))
+
+        rst = [multiprocess_result.get() for multiprocess_result in multiprocess_results]
+
+        all_data = [_['data'] for _ in rst]
+        all_label = [_['label'] for _ in rst]
+        self.data = [[mx.nd.array(data[key]) for key in self.data_name] for data in all_data]
+        self.label = [[mx.nd.array(label[key]) for key in self.label_name] for label in all_label]
+
+def parfetch(config, crop_width, crop_height, isegdb):
+    # get testing data for multigpu
+    data, label = get_segmentation_train_batch(isegdb, config)
+    if config.TRAIN.ENABLE_CROP:
+        data_internal = data['data']
+        label_internal = label['label']
+
+        sx = math.floor(random.random() * (data_internal.shape[3] - crop_width + 1))
+        sy = math.floor(random.random() * (data_internal.shape[2] - crop_height + 1))
+        sx = (int)(sx)
+        sy = (int)(sy)
+        assert(sx >= 0 and sx < data_internal.shape[3] - crop_width + 1)
+        assert(sy >= 0 and sy < data_internal.shape[2] - crop_height + 1)
+
+        ex = (int)(sx + crop_width - 1)
+        ey = (int)(sy + crop_height - 1)
+
+        data_internal = data_internal[:, :, sy : ey + 1, sx : ex + 1]
+        label_internal = label_internal[:, :, sy : ey + 1, sx : ex + 1]
+
+        data['data'] = data_internal
+        label['label'] = label_internal
+        assert (data['data'].shape[2] == crop_height) and (data['data'].shape[3] == crop_width)
+        assert (label['label'].shape[2] == crop_height) and (label['label'].shape[3] == crop_width)
+
+    return {'data': data, 'label': label}
diff --git a/deeplab/core/metric.py b/deeplab/core/metric.py
new file mode 100644
index 0000000..b3eb4b8
--- /dev/null
+++ b/deeplab/core/metric.py
@@ -0,0 +1,39 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import mxnet as mx
+import numpy as np
+
+class FCNLogLossMetric(mx.metric.EvalMetric):
+    def __init__(self, show_interval):
+        super(FCNLogLossMetric, self).__init__('FCNLogLoss')
+        self.show_interval = show_interval
+        self.sum_metric = 0
+        self.num_inst = 0
+
+    def update(self, labels, preds):
+        pred = preds[0]
+        label = labels[0]
+
+        # label (b, p)
+        label = label.asnumpy().astype('int32').reshape((-1))
+        # pred (b, c, p) or (b, c, h, w) --> (b, p, c) --> (b*p, c)
+        pred = pred.asnumpy().reshape((pred.shape[0], pred.shape[1], -1)).transpose((0, 2, 1))
+        pred = pred.reshape((label.shape[0], -1))
+
+        # filter with keep_inds
+        keep_inds = np.where(label != 255)[0]
+        label = label[keep_inds]
+        cls = pred[keep_inds, label]
+
+        cls += 1e-14
+        cls_loss = -1 * np.log(cls)
+        cls_loss = np.sum(cls_loss)
+
+        self.sum_metric += cls_loss
+        self.num_inst += label.shape[0]
diff --git a/deeplab/core/module.py b/deeplab/core/module.py
new file mode 100644
index 0000000..8eff831
--- /dev/null
+++ b/deeplab/core/module.py
@@ -0,0 +1,1069 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+"""A `MutableModule` implement the `BaseModule` API, and allows input shape
+varying with training iterations. If shapes vary, executors will rebind,
+using shared arrays from the initial module binded with maximum shape.
+"""
+
+import time
+import logging
+import warnings
+
+from mxnet import context as ctx
+from mxnet.initializer import Uniform, InitDesc
+from mxnet.module.base_module import BaseModule, _check_input_names, _parse_data_desc, _as_list
+from mxnet.model import _create_kvstore, _initialize_kvstore, _update_params, _update_params_on_kvstore, load_checkpoint, BatchEndParam
+from mxnet import metric
+
+from .DataParallelExecutorGroup import DataParallelExecutorGroup
+from mxnet import ndarray as nd
+from mxnet import optimizer as opt
+
+
+class Module(BaseModule):
+    """Module is a basic module that wrap a `Symbol`. It is functionally the same
+    as the `FeedForward` model, except under the module API.
+
+    Parameters
+    ----------
+    symbol : Symbol
+    data_names : list of str
+        Default is `('data')` for a typical model used in image classification.
+    label_names : list of str
+        Default is `('softmax_label')` for a typical model used in image
+        classification.
+    logger : Logger
+        Default is `logging`.
+    context : Context or list of Context
+        Default is `cpu()`.
+    work_load_list : list of number
+        Default `None`, indicating uniform workload.
+    fixed_param_names: list of str
+        Default `None`, indicating no network parameters are fixed.
+    state_names : list of str
+        states are similar to data and label, but not provided by data iterator.
+        Instead they are initialized to 0 and can be set by set_states()
+    """
+    def __init__(self, symbol, data_names=('data',), label_names=('softmax_label',),
+                 logger=logging, context=ctx.cpu(), work_load_list=None,
+                 fixed_param_names=None, state_names=None):
+        super(Module, self).__init__(logger=logger)
+
+        if isinstance(context, ctx.Context):
+            context = [context]
+        self._context = context
+        if work_load_list is None:
+            work_load_list = [1] * len(self._context)
+        assert len(work_load_list) == len(self._context)
+        self._work_load_list = work_load_list
+
+        self._symbol = symbol
+
+        data_names = list(data_names) if data_names is not None else []
+        label_names = list(label_names) if label_names is not None else []
+        state_names = list(state_names) if state_names is not None else []
+        fixed_param_names = list(fixed_param_names) if fixed_param_names is not None else []
+
+        _check_input_names(symbol, data_names, "data", True)
+        _check_input_names(symbol, label_names, "label", False)
+        _check_input_names(symbol, state_names, "state", True)
+        _check_input_names(symbol, fixed_param_names, "fixed_param", True)
+
+        arg_names = symbol.list_arguments()
+        input_names = data_names + label_names + state_names
+        self._param_names = [x for x in arg_names if x not in input_names]
+        self._fixed_param_names = fixed_param_names
+        self._aux_names = symbol.list_auxiliary_states()
+        self._data_names = data_names
+        self._label_names = label_names
+        self._state_names = state_names
+        self._output_names = symbol.list_outputs()
+
+        self._arg_params = None
+        self._aux_params = None
+        self._params_dirty = False
+
+        self._optimizer = None
+        self._kvstore = None
+        self._update_on_kvstore = None
+        self._updater = None
+        self._preload_opt_states = None
+        self._grad_req = None
+
+        self._exec_group = None
+        self._data_shapes = None
+        self._label_shapes = None
+
+    @staticmethod
+    def load(prefix, epoch, load_optimizer_states=False, **kwargs):
+        """Create a model from previously saved checkpoint.
+
+        Parameters
+        ----------
+        prefix : str
+            path prefix of saved model files. You should have
+            "prefix-symbol.json", "prefix-xxxx.params", and
+            optionally "prefix-xxxx.states", where xxxx is the
+            epoch number.
+        epoch : int
+            epoch to load.
+        load_optimizer_states : bool
+            whether to load optimizer states. Checkpoint needs
+            to have been made with save_optimizer_states=True.
+        data_names : list of str
+            Default is `('data')` for a typical model used in image classification.
+        label_names : list of str
+            Default is `('softmax_label')` for a typical model used in image
+            classification.
+        logger : Logger
+            Default is `logging`.
+        context : Context or list of Context
+            Default is `cpu()`.
+        work_load_list : list of number
+            Default `None`, indicating uniform workload.
+        fixed_param_names: list of str
+            Default `None`, indicating no network parameters are fixed.
+        """
+        sym, args, auxs = load_checkpoint(prefix, epoch)
+        mod = Module(symbol=sym, **kwargs)
+        mod._arg_params = args
+        mod._aux_params = auxs
+        mod.params_initialized = True
+        if load_optimizer_states:
+            mod._preload_opt_states = '%s-%04d.states'%(prefix, epoch)
+        return mod
+
+    def save_checkpoint(self, prefix, epoch, save_optimizer_states=False):
+        """Save current progress to checkpoint.
+        Use mx.callback.module_checkpoint as epoch_end_callback to save during training.
+
+        Parameters
+        ----------
+        prefix : str
+            The file prefix to checkpoint to
+        epoch : int
+            The current epoch number
+        save_optimizer_states : bool
+            Whether to save optimizer states for continue training
+        """
+        self._symbol.save('%s-symbol.json'%prefix)
+        param_name = '%s-%04d.params' % (prefix, epoch)
+        self.save_params(param_name)
+        logging.info('Saved checkpoint to \"%s\"', param_name)
+        if save_optimizer_states:
+            state_name = '%s-%04d.states' % (prefix, epoch)
+            self.save_optimizer_states(state_name)
+            logging.info('Saved optimizer state to \"%s\"', state_name)
+
+    def _reset_bind(self):
+        """Internal function to reset binded state."""
+        self.binded = False
+        self._exec_group = None
+        self._data_shapes = None
+        self._label_shapes = None
+
+    @property
+    def data_names(self):
+        """A list of names for data required by this module."""
+        return self._data_names
+
+    @property
+    def label_names(self):
+        """A list of names for labels required by this module."""
+        return self._label_names
+
+    @property
+    def output_names(self):
+        """A list of names for the outputs of this module."""
+        return self._output_names
+
+    @property
+    def data_shapes(self):
+        """Get data shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs.
+        """
+        assert self.binded
+        return self._data_shapes
+
+    @property
+    def label_shapes(self):
+        """Get label shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs. The return value could be `None` if
+        the module does not need labels, or if the module is not binded for
+        training (in this case, label information is not available).
+        """
+        assert self.binded
+        return self._label_shapes
+
+    @property
+    def output_shapes(self):
+        """Get output shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs.
+        """
+        assert self.binded
+        return self._exec_group.get_output_shapes()
+
+    def get_params(self):
+        """Get current parameters.
+        Returns
+        -------
+        `(arg_params, aux_params)`, each a dictionary of name to parameters (in
+        `NDArray`) mapping.
+        """
+        assert self.binded and self.params_initialized
+
+        if self._params_dirty:
+            self._sync_params_from_devices()
+        return (self._arg_params, self._aux_params)
+
+    def init_params(self, initializer=Uniform(0.01), arg_params=None, aux_params=None,
+                    allow_missing=False, force_init=False):
+        """Initialize the parameters and auxiliary states.
+
+        Parameters
+        ----------
+        initializer : Initializer
+            Called to initialize parameters if needed.
+        arg_params : dict
+            If not None, should be a dictionary of existing arg_params. Initialization
+            will be copied from that.
+        aux_params : dict
+            If not None, should be a dictionary of existing aux_params. Initialization
+            will be copied from that.
+        allow_missing : bool
+            If true, params could contain missing values, and the initializer will be
+            called to fill those missing params.
+        force_init : bool
+            If true, will force re-initialize even if already initialized.
+        """
+        if self.params_initialized and not force_init:
+            warnings.warn("Parameters already initialized and force_init=False. "
+                          "init_params call ignored.", stacklevel=2)
+            return
+        assert self.binded, 'call bind before initializing the parameters'
+
+        def _impl(name, arr, cache):
+            """Internal helper for parameter initialization"""
+            if cache is not None:
+                if name in cache:
+                    cache_arr = cache[name]
+
+                    # just in case the cached array is just the target itself
+                    if cache_arr is not arr:
+                        cache_arr.copyto(arr)
+                else:
+                    if not allow_missing:
+                        raise RuntimeError("%s is not presented" % name)
+                    if initializer != None:
+                        initializer(name, arr)
+            else:
+                initializer(name, arr)
+
+        attrs = self._symbol.attr_dict()
+        for name, arr in self._arg_params.items():
+            desc = InitDesc(name, attrs.get(name, None))
+            _impl(desc, arr, arg_params)
+
+        for name, arr in self._aux_params.items():
+            desc = InitDesc(name, attrs.get(name, None))
+            _impl(desc, arr, aux_params)
+
+        self.params_initialized = True
+        self._params_dirty = False
+
+        # copy the initialized parameters to devices
+        self._exec_group.set_params(self._arg_params, self._aux_params)
+
+    def set_params(self, arg_params, aux_params, allow_missing=False, force_init=True):
+        """Assign parameter and aux state values.
+
+        Parameters
+        ----------
+        arg_params : dict
+            Dictionary of name to value (`NDArray`) mapping.
+        aux_params : dict
+            Dictionary of name to value (`NDArray`) mapping.
+        allow_missing : bool
+            If true, params could contain missing values, and the initializer will be
+            called to fill those missing params.
+        force_init : bool
+            If true, will force re-initialize even if already initialized.
+
+        Examples
+        --------
+        An example of setting module parameters::
+            >>> sym, arg_params, aux_params = \
+            >>>     mx.model.load_checkpoint(model_prefix, n_epoch_load)
+            >>> mod.set_params(arg_params=arg_params, aux_params=aux_params)
+        """
+        if not allow_missing:
+            self.init_params(initializer=None, arg_params=arg_params, aux_params=aux_params,
+                             allow_missing=allow_missing, force_init=force_init)
+            return
+
+        if self.params_initialized and not force_init:
+            warnings.warn("Parameters already initialized and force_init=False. "
+                          "set_params call ignored.", stacklevel=2)
+            return
+
+        self._exec_group.set_params(arg_params, aux_params)
+
+        # because we didn't update self._arg_params, they are dirty now.
+        self._params_dirty = True
+        self.params_initialized = True
+
+    def bind(self, data_shapes, label_shapes=None, for_training=True,
+             inputs_need_grad=False, force_rebind=False, shared_module=None,
+             grad_req='write'):
+        """Bind the symbols to construct executors. This is necessary before one
+        can perform computation with the module.
+
+        Parameters
+        ----------
+        data_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_data`.
+        label_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_label`.
+        for_training : bool
+            Default is `True`. Whether the executors should be bind for training.
+        inputs_need_grad : bool
+            Default is `False`. Whether the gradients to the input data need to be computed.
+            Typically this is not needed. But this might be needed when implementing composition
+            of modules.
+        force_rebind : bool
+            Default is `False`. This function does nothing if the executors are already
+            binded. But with this `True`, the executors will be forced to rebind.
+        shared_module : Module
+            Default is `None`. This is used in bucketing. When not `None`, the shared module
+            essentially corresponds to a different bucket -- a module with different symbol
+            but with the same sets of parameters (e.g. unrolled RNNs with different lengths).
+        """
+        # force rebinding is typically used when one want to switch from
+        # training to prediction phase.
+        if force_rebind:
+            self._reset_bind()
+
+        if self.binded:
+            self.logger.warning('Already binded, ignoring bind()')
+            return
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+        self.binded = True
+        self._grad_req = grad_req
+
+        if not for_training:
+            assert not inputs_need_grad
+        else:
+            pass
+            # this is not True, as some module might not contains a loss function
+            # that consumes the labels
+            # assert label_shapes is not None
+
+        # self._data_shapes, self._label_shapes = _parse_data_desc(
+        #     self.data_names, self.label_names, data_shapes, label_shapes)
+        self._data_shapes, self._label_shapes = zip(*[_parse_data_desc(self.data_names, self.label_names, data_shape, label_shape)
+                                                      for data_shape, label_shape in zip(data_shapes, label_shapes)])
+        if self._label_shapes.count(None) == len(self._label_shapes):
+            self._label_shapes = None
+
+        if shared_module is not None:
+            assert isinstance(shared_module, Module) and \
+                    shared_module.binded and shared_module.params_initialized
+            shared_group = shared_module._exec_group
+        else:
+            shared_group = None
+
+        self._exec_group = DataParallelExecutorGroup(self._symbol, self._context,
+                                                     self._work_load_list, self._data_shapes,
+                                                     self._label_shapes, self._param_names,
+                                                     for_training, inputs_need_grad,
+                                                     shared_group, logger=self.logger,
+                                                     fixed_param_names=self._fixed_param_names,
+                                                     grad_req=grad_req,
+                                                     state_names=self._state_names)
+        # self._total_exec_bytes = self._exec_group._total_exec_bytes
+        if shared_module is not None:
+            self.params_initialized = True
+            self._arg_params = shared_module._arg_params
+            self._aux_params = shared_module._aux_params
+        elif self.params_initialized:
+            # if the parameters are already initialized, we are re-binding
+            # so automatically copy the already initialized params
+            self._exec_group.set_params(self._arg_params, self._aux_params)
+        else:
+            assert self._arg_params is None and self._aux_params is None
+            param_arrays = [
+                nd.zeros(x[0].shape, dtype=x[0].dtype)
+                for x in self._exec_group.param_arrays
+            ]
+            self._arg_params = {name:arr for name, arr in zip(self._param_names, param_arrays)}
+
+            aux_arrays = [
+                nd.zeros(x[0].shape, dtype=x[0].dtype)
+                for x in self._exec_group.aux_arrays
+            ]
+            self._aux_params = {name:arr for name, arr in zip(self._aux_names, aux_arrays)}
+
+        if shared_module is not None and shared_module.optimizer_initialized:
+            self.borrow_optimizer(shared_module)
+
+
+    def reshape(self, data_shapes, label_shapes=None):
+        """Reshape the module for new input shapes.
+
+        Parameters
+        ----------
+        data_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_data`.
+        label_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_label`.
+        """
+        assert self.binded
+        # self._data_shapes, self._label_shapes = _parse_data_desc(
+        #     self.data_names, self.label_names, data_shapes, label_shapes)
+        self._data_shapes, self._label_shapes = zip(*[_parse_data_desc(self.data_names, self.label_names, data_shape, label_shape)
+                                                      for data_shape, label_shape in zip(data_shapes, label_shapes)])
+
+        self._exec_group.reshape(self._data_shapes, self._label_shapes)
+
+
+    def init_optimizer(self, kvstore='local', optimizer='sgd',
+                       optimizer_params=(('learning_rate', 0.01),), force_init=False):
+        """Install and initialize optimizers.
+
+        Parameters
+        ----------
+        kvstore : str or KVStore
+            Default `'local'`.
+        optimizer : str or Optimizer
+            Default `'sgd'`
+        optimizer_params : dict
+            Default `(('learning_rate', 0.01),)`. The default value is not a dictionary,
+            just to avoid pylint warning of dangerous default values.
+        force_init : bool
+            Default `False`, indicating whether we should force re-initializing the
+            optimizer in the case an optimizer is already installed.
+        """
+        assert self.binded and self.params_initialized
+
+        if self.optimizer_initialized and not force_init:
+            self.logger.warning('optimizer already initialized, ignoring...')
+            return
+
+        (kvstore, update_on_kvstore) = \
+                _create_kvstore(kvstore, len(self._context), self._arg_params)
+
+        batch_size = self._exec_group.batch_size
+        if kvstore and 'dist' in kvstore.type and '_sync' in kvstore.type:
+            batch_size *= kvstore.num_workers
+        rescale_grad = 1.0/batch_size
+
+        if isinstance(optimizer, str):
+            idx2name = {}
+            if update_on_kvstore:
+                idx2name.update(enumerate(self._exec_group.param_names))
+            else:
+                for k in range(len(self._context)):
+                    idx2name.update({i*len(self._context)+k: n
+                                     for i, n in enumerate(self._exec_group.param_names)})
+            optimizer_params = dict(optimizer_params)
+            if 'rescale_grad' not in optimizer_params:
+                optimizer_params['rescale_grad'] = rescale_grad
+            optimizer = opt.create(optimizer,
+                                   sym=self.symbol, param_idx2name=idx2name,
+                                   **optimizer_params)
+        else:
+            assert isinstance(optimizer, opt.Optimizer)
+            if optimizer.rescale_grad != rescale_grad:
+                #pylint: disable=no-member
+                warnings.warn(
+                    "Optimizer created manually outside Module but rescale_grad " +
+                    "is not normalized to 1.0/batch_size/num_workers (%s vs. %s). "%(
+                        optimizer.rescale_grad, rescale_grad) +
+                    "Is this intended?", stacklevel=2)
+
+        self._optimizer = optimizer
+        self._kvstore = kvstore
+        self._update_on_kvstore = update_on_kvstore
+        self._updater = None
+
+        if kvstore:
+            # copy initialized local parameters to kvstore
+            _initialize_kvstore(kvstore=kvstore,
+                                param_arrays=self._exec_group.param_arrays,
+                                arg_params=self._arg_params,
+                                param_names=self._param_names,
+                                update_on_kvstore=update_on_kvstore)
+        if update_on_kvstore:
+            kvstore.set_optimizer(self._optimizer)
+        else:
+            self._updater = opt.get_updater(optimizer)
+
+        self.optimizer_initialized = True
+
+        if self._preload_opt_states is not None:
+            self.load_optimizer_states(self._preload_opt_states)
+            self._preload_opt_states = None
+
+    def borrow_optimizer(self, shared_module):
+        """Borrow optimizer from a shared module. Used in bucketing, where exactly the same
+        optimizer (esp. kvstore) is used.
+
+        Parameters
+        ----------
+        shared_module : Module
+        """
+        assert shared_module.optimizer_initialized
+        self._optimizer = shared_module._optimizer
+        self._kvstore = shared_module._kvstore
+        self._update_on_kvstore = shared_module._update_on_kvstore
+        self._updater = shared_module._updater
+        self.optimizer_initialized = True
+
+    def forward(self, data_batch, is_train=None):
+        """Forward computation.
+
+        Parameters
+        ----------
+        data_batch : DataBatch
+            Could be anything with similar API implemented.
+        is_train : bool
+            Default is `None`, which means `is_train` takes the value of `self.for_training`.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.forward(data_batch, is_train)
+
+    def backward(self, out_grads=None):
+        """Backward computation.
+
+        Parameters
+        ----------
+        out_grads : NDArray or list of NDArray, optional
+            Gradient on the outputs to be propagated back.
+            This parameter is only needed when bind is called
+            on outputs that are not a loss function.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.backward(out_grads=out_grads)
+
+    def update(self):
+        """Update parameters according to the installed optimizer and the gradients computed
+        in the previous forward-backward batch.
+        """
+        assert self.binded and self.params_initialized and self.optimizer_initialized
+
+        self._params_dirty = True
+        if self._update_on_kvstore:
+            _update_params_on_kvstore(self._exec_group.param_arrays,
+                                      self._exec_group.grad_arrays,
+                                      self._kvstore)
+        else:
+            _update_params(self._exec_group.param_arrays,
+                           self._exec_group.grad_arrays,
+                           updater=self._updater,
+                           num_device=len(self._context),
+                           kvstore=self._kvstore)
+
+    def get_outputs(self, merge_multi_context=True):
+        """Get outputs of the previous forward computation.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized
+        return self._exec_group.get_outputs(merge_multi_context=merge_multi_context)
+
+    def get_input_grads(self, merge_multi_context=True):
+        """Get the gradients with respect to the inputs of the module.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[grad1, grad2]`. Otherwise, it
+        is like `[[grad1_dev1, grad1_dev2], [grad2_dev1, grad2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized and self.inputs_need_grad
+        return self._exec_group.get_input_grads(merge_multi_context=merge_multi_context)
+
+    def get_states(self, merge_multi_context=True):
+        """Get states from all devices
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the states
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized
+        return self._exec_group.get_states(merge_multi_context=merge_multi_context)
+
+    def set_states(self, states=None, value=None):
+        """Set value for states. Only one of states & value can be specified.
+
+        Parameters
+        ----------
+        states : list of list of NDArrays
+            source states arrays formatted like [[state1_dev1, state1_dev2],
+            [state2_dev1, state2_dev2]].
+        value : number
+            a single scalar value for all state arrays.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.set_states(states, value)
+
+    def update_metric(self, eval_metric, labels):
+        """Evaluate and accumulate evaluation metric on outputs of the last forward computation.
+
+        Parameters
+        ----------
+        eval_metric : EvalMetric
+        labels : list of NDArray
+            Typically `data_batch.label`.
+        """
+        self._exec_group.update_metric(eval_metric, labels)
+
+    def _sync_params_from_devices(self):
+        """Synchronize parameters from devices to CPU. This function should be called after
+        calling `update` that updates the parameters on the devices, before one can read the
+        latest parameters from `self._arg_params` and `self._aux_params`.
+        """
+        self._exec_group.get_params(self._arg_params, self._aux_params)
+        self._params_dirty = False
+
+    def save_optimizer_states(self, fname):
+        """Save optimizer (updater) state to file
+
+        Parameters
+        ----------
+        fname : str
+            Path to output states file.
+        """
+        assert self.optimizer_initialized
+
+        if self._update_on_kvstore:
+            self._kvstore.save_optimizer_states(fname)
+        else:
+            with open(fname, 'wb') as fout:
+                fout.write(self._updater.get_states())
+
+    def load_optimizer_states(self, fname):
+        """Load optimizer (updater) state from file
+
+        Parameters
+        ----------
+        fname : str
+            Path to input states file.
+        """
+        assert self.optimizer_initialized
+
+        if self._update_on_kvstore:
+            self._kvstore.load_optimizer_states(fname)
+        else:
+            self._updater.set_states(open(fname, 'rb').read())
+
+    def install_monitor(self, mon):
+        """ Install monitor on all executors """
+        assert self.binded
+        self._exec_group.install_monitor(mon)
+
+
+
+
+class MutableModule(BaseModule):
+    """A mutable module is a module that supports variable input data.
+
+    Parameters
+    ----------
+    symbol : Symbol
+    data_names : list of str
+    label_names : list of str
+    logger : Logger
+    context : Context or list of Context
+    work_load_list : list of number
+    max_data_shapes : list of (name, shape) tuple, designating inputs whose shape vary
+    max_label_shapes : list of (name, shape) tuple, designating inputs whose shape vary
+    fixed_param_prefix : list of str, indicating fixed parameters
+    """
+    def __init__(self, symbol, data_names, label_names,
+                 logger=logging, context=ctx.cpu(), work_load_list=None,
+                 max_data_shapes=None, max_label_shapes=None, fixed_param_prefix=None):
+        super(MutableModule, self).__init__(logger=logger)
+        self._symbol = symbol
+        self._data_names = data_names
+        self._label_names = label_names
+        self._context = context
+        self._work_load_list = work_load_list
+
+        self._curr_module = None
+        self._max_data_shapes = max_data_shapes
+        self._max_label_shapes = max_label_shapes
+        self._fixed_param_prefix = fixed_param_prefix
+
+        fixed_param_names = list()
+        if fixed_param_prefix is not None:
+            for name in self._symbol.list_arguments():
+                for prefix in self._fixed_param_prefix:
+                    if prefix in name:
+                        fixed_param_names.append(name)
+        self._fixed_param_names = fixed_param_names
+        self._preload_opt_states = None
+
+    def _reset_bind(self):
+        self.binded = False
+        self._curr_module = None
+
+    @property
+    def data_names(self):
+        return self._data_names
+
+    @property
+    def output_names(self):
+        return self._symbol.list_outputs()
+
+    @property
+    def data_shapes(self):
+        assert self.binded
+        return self._curr_module.data_shapes
+
+    @property
+    def label_shapes(self):
+        assert self.binded
+        return self._curr_module.label_shapes
+
+    @property
+    def output_shapes(self):
+        assert self.binded
+        return self._curr_module.output_shapes
+
+    def get_params(self):
+        assert self.binded and self.params_initialized
+        return self._curr_module.get_params()
+
+    def init_params(self, initializer=Uniform(0.01), arg_params=None, aux_params=None,
+                    allow_missing=False, force_init=False):
+        if self.params_initialized and not force_init:
+            return
+        assert self.binded, 'call bind before initializing the parameters'
+        self._curr_module.init_params(initializer=initializer, arg_params=arg_params,
+                                      aux_params=aux_params, allow_missing=allow_missing,
+                                      force_init=force_init)
+        self.params_initialized = True
+
+    def bind(self, data_shapes, label_shapes=None, for_training=True,
+             inputs_need_grad=False, force_rebind=False, shared_module=None, grad_req='write'):
+        # in case we already initialized params, keep it
+        if self.params_initialized:
+            arg_params, aux_params = self.get_params()
+
+        # force rebinding is typically used when one want to switch from
+        # training to prediction phase.
+        if force_rebind:
+            self._reset_bind()
+
+        if self.binded:
+            self.logger.warning('Already binded, ignoring bind()')
+            return
+
+        assert shared_module is None, 'shared_module for MutableModule is not supported'
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+        self.binded = True
+
+        max_shapes_dict = dict()
+        if self._max_data_shapes is not None:
+            max_shapes_dict.update(dict(self._max_data_shapes[0]))
+        if self._max_label_shapes is not None:
+            max_shapes_dict.update(dict(self._max_label_shapes[0]))
+
+        max_data_shapes = list()
+        for name, shape in data_shapes[0]:
+            if name in max_shapes_dict:
+                max_data_shapes.append((name, max_shapes_dict[name]))
+            else:
+                max_data_shapes.append((name, shape))
+
+        max_label_shapes = list()
+        if not label_shapes.count(None) == len(label_shapes):
+            for name, shape in label_shapes[0]:
+                if name in max_shapes_dict:
+                    max_label_shapes.append((name, max_shapes_dict[name]))
+                else:
+                    max_label_shapes.append((name, shape))
+
+        if len(max_label_shapes) == 0:
+            max_label_shapes = None
+
+        module = Module(self._symbol, self._data_names, self._label_names, logger=self.logger,
+                        context=self._context, work_load_list=self._work_load_list,
+                        fixed_param_names=self._fixed_param_names)
+        module.bind([max_data_shapes for _ in xrange(len(self._context))], [max_label_shapes for _ in xrange(len(self._context))],
+                    for_training, inputs_need_grad, force_rebind=False, shared_module=None)
+        self._curr_module = module
+
+        # copy back saved params, if already initialized
+        if self.params_initialized:
+            self.set_params(arg_params, aux_params)
+
+    def save_checkpoint(self, prefix, epoch, save_optimizer_states=False):
+        """Save current progress to checkpoint.
+        Use mx.callback.module_checkpoint as epoch_end_callback to save during training.
+
+        Parameters
+        ----------
+        prefix : str
+            The file prefix to checkpoint to
+        epoch : int
+            The current epoch number
+        save_optimizer_states : bool
+            Whether to save optimizer states for continue training
+        """
+        self._curr_module.save_checkpoint(prefix, epoch, save_optimizer_states)
+
+    def init_optimizer(self, kvstore='local', optimizer='sgd',
+                       optimizer_params=(('learning_rate', 0.01),), force_init=False):
+        assert self.binded and self.params_initialized
+        if self.optimizer_initialized and not force_init:
+            self.logger.warning('optimizer already initialized, ignoring.')
+            return
+
+        self._curr_module._preload_opt_states = self._preload_opt_states
+        self._curr_module.init_optimizer(kvstore, optimizer, optimizer_params,
+                                         force_init=force_init)
+        self.optimizer_initialized = True
+
+    def fit(self, train_data, eval_data=None, eval_metric='acc',
+            epoch_end_callback=None, batch_end_callback=None, kvstore='local',
+            optimizer='sgd', optimizer_params=(('learning_rate', 0.01),),
+            eval_end_callback=None,
+            eval_batch_end_callback=None, initializer=Uniform(0.01),
+            arg_params=None, aux_params=None, allow_missing=False,
+            force_rebind=False, force_init=False, begin_epoch=0, num_epoch=None,
+            validation_metric=None, monitor=None, prefix=None):
+        """Train the module parameters.
+
+        Parameters
+        ----------
+        train_data : DataIter
+        eval_data : DataIter
+            If not `None`, will be used as validation set and evaluate the performance
+            after each epoch.
+        eval_metric : str or EvalMetric
+            Default `'acc'`. The performance measure used to display during training.
+        epoch_end_callback : function or list of function
+            Each callback will be called with the current `epoch`, `symbol`, `arg_params`
+            and `aux_params`.
+        batch_end_callback : function or list of function
+            Each callback will be called with a `BatchEndParam`.
+        kvstore : str or KVStore
+            Default `'local'`.
+        optimizer : str or Optimizer
+            Default `'sgd'`
+        optimizer_params : dict
+            Default `(('learning_rate', 0.01),)`. The parameters for the optimizer constructor.
+            The default value is not a `dict`, just to avoid pylint warning on dangerous
+            default values.
+        eval_end_callback : function or list of function
+            These will be called at the end of each full evaluation, with the metrics over
+            the entire evaluation set.
+        eval_batch_end_callback : function or list of function
+            These will be called at the end of each minibatch during evaluation
+        initializer : Initializer
+            Will be called to initialize the module parameters if not already initialized.
+        arg_params : dict
+            Default `None`, if not `None`, should be existing parameters from a trained
+            model or loaded from a checkpoint (previously saved model). In this case,
+            the value here will be used to initialize the module parameters, unless they
+            are already initialized by the user via a call to `init_params` or `fit`.
+            `arg_params` has higher priority to `initializer`.
+        aux_params : dict
+            Default `None`. Similar to `arg_params`, except for auxiliary states.
+        allow_missing : bool
+            Default `False`. Indicate whether we allow missing parameters when `arg_params`
+            and `aux_params` are not `None`. If this is `True`, then the missing parameters
+            will be initialized via the `initializer`.
+        force_rebind : bool
+            Default `False`. Whether to force rebinding the executors if already binded.
+        force_init : bool
+            Default `False`. Indicate whether we should force initialization even if the
+            parameters are already initialized.
+        begin_epoch : int
+            Default `0`. Indicate the starting epoch. Usually, if we are resuming from a
+            checkpoint saved at a previous training phase at epoch N, then we should specify
+            this value as N+1.
+        num_epoch : int
+            Number of epochs to run training.
+
+        Examples
+        --------
+        An example of using fit for training::
+            >>> #Assume training dataIter and validation dataIter are ready
+            >>> mod.fit(train_data=train_dataiter, eval_data=val_dataiter,
+                        optimizer_params={'learning_rate':0.01, 'momentum': 0.9},
+                        num_epoch=10)
+        """
+        assert num_epoch is not None, 'please specify number of epochs'
+
+        self.bind(data_shapes=train_data.provide_data, label_shapes=train_data.provide_label,
+                  for_training=True, force_rebind=force_rebind)
+        if monitor is not None:
+            self.install_monitor(monitor)
+        self.init_params(initializer=initializer, arg_params=arg_params, aux_params=aux_params,
+                         allow_missing=allow_missing, force_init=force_init)
+        self.init_optimizer(kvstore=kvstore, optimizer=optimizer,
+                            optimizer_params=optimizer_params)
+
+        if validation_metric is None:
+            validation_metric = eval_metric
+        if not isinstance(eval_metric, metric.EvalMetric):
+            eval_metric = metric.create(eval_metric)
+
+        ################################################################################
+        # training loop
+        ################################################################################
+        for epoch in range(begin_epoch, num_epoch):
+            tic = time.time()
+            eval_metric.reset()
+            for nbatch, data_batch in enumerate(train_data):
+                if monitor is not None:
+                    monitor.tic()
+                self.forward_backward(data_batch)
+                self.update()
+                self.update_metric(eval_metric, data_batch.label)
+
+                if monitor is not None:
+                    monitor.toc_print()
+
+                if batch_end_callback is not None:
+                    batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
+                                                     eval_metric=eval_metric,
+                                                     locals=locals())
+                    for callback in _as_list(batch_end_callback):
+                        callback(batch_end_params)
+
+            # one epoch of training is finished
+            for name, val in eval_metric.get_name_value():
+                self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)
+            toc = time.time()
+            self.logger.info('Epoch[%d] Time cost=%.3f', epoch, (toc-tic))
+
+            # sync aux params across devices
+            arg_params, aux_params = self.get_params()
+            self.set_params(arg_params, aux_params)
+
+            if epoch_end_callback is not None:
+                for callback in _as_list(epoch_end_callback):
+                    callback(epoch, self.symbol, arg_params, aux_params)
+
+            #----------------------------------------
+            # evaluation on validation set
+            if eval_data:
+                res = self.score(eval_data, validation_metric,
+                                 score_end_callback=eval_end_callback,
+                                 batch_end_callback=eval_batch_end_callback, epoch=epoch)
+                #TODO: pull this into default
+                for name, val in res:
+                    self.logger.info('Epoch[%d] Validation-%s=%f', epoch, name, val)
+
+            # end of 1 epoch, reset the data-iter for another epoch
+            train_data.reset()
+
+
+    def forward(self, data_batch, is_train=None):
+        assert self.binded and self.params_initialized
+
+        # get current_shapes
+        if self._curr_module.label_shapes is not None:
+            current_shapes = [dict(self._curr_module.data_shapes[i] + self._curr_module.label_shapes[i]) for i in xrange(len(self._context))]
+        else:
+            current_shapes = [dict(self._curr_module.data_shapes[i]) for i in xrange(len(self._context))]
+
+        # get input_shapes
+        if is_train:
+            input_shapes = [dict(data_batch.provide_data[i] + data_batch.provide_label[i]) for i in xrange(len(self._context))]
+        else:
+            input_shapes = [dict(data_batch.provide_data[i]) for i in xrange(len(data_batch.provide_data))]
+
+        # decide if shape changed
+        shape_changed = len(current_shapes) != len(input_shapes)
+        for pre, cur in zip(current_shapes, input_shapes):
+            for k, v in pre.items():
+                if v != cur[k]:
+                    shape_changed = True
+
+        if shape_changed:
+            # self._curr_module.reshape(data_batch.provide_data, data_batch.provide_label)
+            module = Module(self._symbol, self._data_names, self._label_names,
+                            logger=self.logger, context=[self._context[i] for i in xrange(len(data_batch.provide_data))],
+                            work_load_list=self._work_load_list,
+                            fixed_param_names=self._fixed_param_names)
+            module.bind(data_batch.provide_data, data_batch.provide_label, self._curr_module.for_training,
+                        self._curr_module.inputs_need_grad, force_rebind=False,
+                        shared_module=self._curr_module)
+            self._curr_module = module
+
+        self._curr_module.forward(data_batch, is_train=is_train)
+
+    def backward(self, out_grads=None):
+        assert self.binded and self.params_initialized
+        self._curr_module.backward(out_grads=out_grads)
+
+    def update(self):
+        assert self.binded and self.params_initialized and self.optimizer_initialized
+        self._curr_module.update()
+
+    def get_outputs(self, merge_multi_context=True):
+        assert self.binded and self.params_initialized
+        return self._curr_module.get_outputs(merge_multi_context=merge_multi_context)
+    def get_input_grads(self, merge_multi_context=True):
+        assert self.binded and self.params_initialized and self.inputs_need_grad
+        return self._curr_module.get_input_grads(merge_multi_context=merge_multi_context)
+
+    def update_metric(self, eval_metric, labels):
+        assert self.binded and self.params_initialized
+        self._curr_module.update_metric(eval_metric, labels)
+
+    def install_monitor(self, mon):
+        """ Install monitor on all executors """
+        assert self.binded
+        self._curr_module.install_monitor(mon)
diff --git a/deeplab/core/tester.py b/deeplab/core/tester.py
new file mode 100644
index 0000000..7309cac
--- /dev/null
+++ b/deeplab/core/tester.py
@@ -0,0 +1,123 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import cPickle
+import os
+import time
+import mxnet as mx
+import numpy as np
+
+from PIL import Image
+from module import MutableModule
+from config.config import config
+from utils import image
+from utils.PrefetchingIter import PrefetchingIter
+
+
+class Predictor(object):
+    def __init__(self, symbol, data_names, label_names,
+                 context=mx.cpu(), max_data_shapes=None,
+                 provide_data=None, provide_label=None,
+                 arg_params=None, aux_params=None):
+        self._mod = MutableModule(symbol, data_names, label_names,
+                                  context=context, max_data_shapes=max_data_shapes)
+        self._mod.bind(provide_data, provide_label, for_training=False)
+        self._mod.init_params(arg_params=arg_params, aux_params=aux_params)
+
+    def predict(self, data_batch):
+        self._mod.forward(data_batch)
+        # [dict(zip(self._mod.output_names, _)) for _ in zip(*self._mod.get_outputs(merge_multi_context=False))]
+        return [dict(zip(self._mod.output_names, _)) for _ in zip(*self._mod.get_outputs(merge_multi_context=False))]
+
+def pred_eval(predictor, test_data, imdb, vis=False, ignore_cache=None, logger=None):
+    """
+    wrapper for calculating offline validation for faster data analysis
+    in this example, all threshold are set by hand
+    :param predictor: Predictor
+    :param test_data: data iterator, must be non-shuffle
+    :param imdb: image database
+    :param vis: controls visualization
+    :param ignore_cache: ignore the saved cache file
+    :param logger: the logger instance
+    :return:
+    """
+    res_file = os.path.join(imdb.result_path, imdb.name + '_segmentations.pkl')
+    if os.path.exists(res_file) and not ignore_cache:
+        with open(res_file , 'rb') as fid:
+            evaluation_results = cPickle.load(fid)
+        print 'evaluate segmentation: \n'
+        if logger:
+            logger.info('evaluate segmentation: \n')
+
+        meanIU = evaluation_results['meanIU']
+        IU_array = evaluation_results['IU_array']
+        print 'IU_array:\n'
+        if logger:
+            logger.info('IU_array:\n')
+        for i in range(len(IU_array)):
+            print '%.5f'%IU_array[i]
+            if logger:
+                logger.info('%.5f'%IU_array[i])
+        print 'meanIU:%.5f'%meanIU
+        if logger:
+            logger.info( 'meanIU:%.5f'%meanIU)
+        return
+
+    assert vis or not test_data.shuffle
+    if not isinstance(test_data, PrefetchingIter):
+        test_data = PrefetchingIter(test_data)
+
+    num_images = imdb.num_images
+    all_segmentation_result = [[] for _ in xrange(num_images)]
+    idx = 0
+
+    data_time, net_time, post_time = 0.0, 0.0, 0.0
+    t = time.time()
+    for data_batch in test_data:
+        t1 = time.time() - t
+        t = time.time()
+        output_all = predictor.predict(data_batch)
+        output_all = [mx.ndarray.argmax(output['softmax_output'], axis=1).asnumpy() for output in output_all]
+        t2 = time.time() - t
+        t = time.time()
+
+        all_segmentation_result[idx: idx+test_data.batch_size] = [output.astype('int8') for output in output_all]
+
+        idx += test_data.batch_size
+        t3 = time.time() - t
+        t = time.time()
+
+        data_time += t1
+        net_time += t2
+        post_time += t3
+        print 'testing {}/{} data {:.4f}s net {:.4f}s post {:.4f}s'.format(idx, imdb.num_images, data_time / idx * test_data.batch_size, net_time / idx * test_data.batch_size, post_time / idx * test_data.batch_size)
+        if logger:
+            logger.info('testing {}/{} data {:.4f}s net {:.4f}s post {:.4f}s'.format(idx, imdb.num_images, data_time / idx * test_data.batch_size, net_time / idx * test_data.batch_size, post_time / idx * test_data.batch_size))
+
+    evaluation_results = imdb.evaluate_segmentations(all_segmentation_result)
+
+    if not os.path.exists(res_file) or ignore_cache:
+        with open(res_file, 'wb') as f:
+            cPickle.dump(evaluation_results, f, protocol=cPickle.HIGHEST_PROTOCOL)
+
+    print 'evaluate segmentation: \n'
+    if logger:
+        logger.info('evaluate segmentation: \n')
+
+    meanIU = evaluation_results['meanIU']
+    IU_array = evaluation_results['IU_array']
+    print 'IU_array:\n'
+    if logger:
+        logger.info('IU_array:\n')
+    for i in range(len(IU_array)):
+        print '%.5f'%IU_array[i]
+        if logger:
+            logger.info('%.5f'%IU_array[i])
+    print 'meanIU:%.5f'%meanIU
+    if logger:
+        logger.info( 'meanIU:%.5f'%meanIU)
diff --git a/deeplab/demo.py b/deeplab/demo.py
new file mode 100644
index 0000000..9ce80b3
--- /dev/null
+++ b/deeplab/demo.py
@@ -0,0 +1,165 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import _init_paths
+
+import argparse
+import os
+import sys
+import logging
+import pprint
+import cv2
+from config.config import config, update_config
+from utils.image import resize, transform
+from PIL import Image
+import numpy as np
+
+# get config
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+cur_path = os.path.abspath(os.path.dirname(__file__))
+update_config(cur_path + '/../experiments/deeplab/cfgs/deeplab_cityscapes_demo.yaml')
+
+sys.path.insert(0, os.path.join(cur_path, '../external/mxnet', config.MXNET_VERSION))
+import mxnet as mx
+from core.tester import pred_eval, Predictor
+from symbols import *
+from utils.load_model import load_param
+from utils.tictoc import tic, toc
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Show Deformable ConvNets demo')
+    # general
+    parser.add_argument('--deeplab_only', help='whether use Deeplab only (w/o Deformable ConvNets)', default=False, action='store_true')
+
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+
+def getpallete(num_cls):
+    """
+    this function is to get the colormap for visualizing the segmentation mask
+    :param num_cls: the number of visulized class
+    :return: the pallete
+    """
+    n = num_cls
+    pallete_raw = np.zeros((n, 3)).astype('uint8')
+    pallete = np.zeros((n, 3)).astype('uint8')
+
+    pallete_raw[6, :] =  [111,  74,   0]
+    pallete_raw[7, :] =  [ 81,   0,  81]
+    pallete_raw[8, :] =  [128,  64, 128]
+    pallete_raw[9, :] =  [244,  35, 232]
+    pallete_raw[10, :] =  [250, 170, 160]
+    pallete_raw[11, :] = [230, 150, 140]
+    pallete_raw[12, :] = [ 70,  70,  70]
+    pallete_raw[13, :] = [102, 102, 156]
+    pallete_raw[14, :] = [190, 153, 153]
+    pallete_raw[15, :] = [180, 165, 180]
+    pallete_raw[16, :] = [150, 100, 100]
+    pallete_raw[17, :] = [150, 120,  90]
+    pallete_raw[18, :] = [153, 153, 153]
+    pallete_raw[19, :] = [153, 153, 153]
+    pallete_raw[20, :] = [250, 170,  30]
+    pallete_raw[21, :] = [220, 220,   0]
+    pallete_raw[22, :] = [107, 142,  35]
+    pallete_raw[23, :] = [152, 251, 152]
+    pallete_raw[24, :] = [ 70, 130, 180]
+    pallete_raw[25, :] = [220,  20,  60]
+    pallete_raw[26, :] = [255,   0,   0]
+    pallete_raw[27, :] = [  0,   0, 142]
+    pallete_raw[28, :] = [  0,   0,  70]
+    pallete_raw[29, :] = [  0,  60, 100]
+    pallete_raw[30, :] = [  0,   0,  90]
+    pallete_raw[31, :] = [  0,   0, 110]
+    pallete_raw[32, :] = [  0,  80, 100]
+    pallete_raw[33, :] = [  0,   0, 230]
+    pallete_raw[34, :] = [119,  11,  32]
+
+    train2regular = [7, 8, 11, 12, 13, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 31, 32, 33]
+
+    for i in range(len(train2regular)):
+        pallete[i, :] = pallete_raw[train2regular[i]+1, :]
+
+    pallete = pallete.reshape(-1)
+
+    return pallete
+
+def main():
+    # get symbol
+    pprint.pprint(config)
+    config.symbol = 'resnet_v1_101_deeplab_dcn' if not args.deeplab_only else 'resnet_v1_101_deeplab'
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol(config, is_train=False)
+
+    # set up class names
+    num_classes = 19
+
+    # load demo data
+    image_names = ['frankfurt_000001_073088_leftImg8bit.png', 'lindau_000024_000019_leftImg8bit.png']
+    data = []
+    for im_name in image_names:
+        assert os.path.exists(cur_path + '/../demo/' + im_name), ('%s does not exist'.format('../demo/' + im_name))
+        im = cv2.imread(cur_path + '/../demo/' + im_name, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+        target_size = config.SCALES[0][0]
+        max_size = config.SCALES[0][1]
+        im, im_scale = resize(im, target_size, max_size, stride=config.network.IMAGE_STRIDE)
+        im_tensor = transform(im, config.network.PIXEL_MEANS)
+        im_info = np.array([[im_tensor.shape[2], im_tensor.shape[3], im_scale]], dtype=np.float32)
+        data.append({'data': im_tensor, 'im_info': im_info})
+
+
+    # get predictor
+    data_names = ['data']
+    label_names = ['softmax_label']
+    data = [[mx.nd.array(data[i][name]) for name in data_names] for i in xrange(len(data))]
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]]
+    provide_data = [[(k, v.shape) for k, v in zip(data_names, data[i])] for i in xrange(len(data))]
+    provide_label = [None for i in xrange(len(data))]
+    arg_params, aux_params = load_param(cur_path + '/../model/' + ('deeplab_dcn_cityscapes' if not args.deeplab_only else 'deeplab_cityscapes'), 0, process=True)
+    predictor = Predictor(sym, data_names, label_names,
+                          context=[mx.gpu(0)], max_data_shapes=max_data_shape,
+                          provide_data=provide_data, provide_label=provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # warm up
+    for j in xrange(2):
+        data_batch = mx.io.DataBatch(data=[data[0]], label=[], pad=0, index=0,
+                                     provide_data=[[(k, v.shape) for k, v in zip(data_names, data[0])]],
+                                     provide_label=[None])
+        output_all = predictor.predict(data_batch)
+        output_all = [mx.ndarray.argmax(output['softmax_output'], axis=1).asnumpy() for output in output_all]
+
+    # test
+    for idx, im_name in enumerate(image_names):
+        data_batch = mx.io.DataBatch(data=[data[idx]], label=[], pad=0, index=idx,
+                                     provide_data=[[(k, v.shape) for k, v in zip(data_names, data[idx])]],
+                                     provide_label=[None])
+
+        tic()
+        output_all = predictor.predict(data_batch)
+        output_all = [mx.ndarray.argmax(output['softmax_output'], axis=1).asnumpy() for output in output_all]
+        pallete = getpallete(256)
+
+        segmentation_result = np.uint8(np.squeeze(output_all))
+        segmentation_result = Image.fromarray(segmentation_result)
+        segmentation_result.putpalette(pallete)
+        print 'testing {} {:.4f}s'.format(im_name, toc())
+        pure_im_name, ext_im_name = os.path.splitext(im_name)
+        segmentation_result.save(cur_path + '/../demo/seg_' + pure_im_name + '.png')
+        # visualize
+        im_raw = cv2.imread(cur_path + '/../demo/' + im_name)
+        seg_res = cv2.imread(cur_path + '/../demo/seg_' + pure_im_name + '.png')
+        cv2.imshow('Raw Image', im_raw)
+        cv2.imshow('segmentation_result', seg_res)
+        cv2.waitKey(0)
+    print 'done'
+
+if __name__ == '__main__':
+    main()
diff --git a/deeplab/function/__init__.py b/deeplab/function/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/deeplab/function/reeval.py b/deeplab/function/reeval.py
new file mode 100644
index 0000000..0fdd77f
--- /dev/null
+++ b/deeplab/function/reeval.py
@@ -0,0 +1,54 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import argparse
+import cPickle
+import os
+import mxnet as mx
+
+from config.config import config, generate_config
+from dataset import *
+
+
+def reeval(args):
+    # load imdb
+    imdb = eval(args.dataset)(args.image_set, args.root_path, args.dataset_path)
+
+    # load detection results
+    cache_file = os.path.join(imdb.cache_path, imdb.name, 'detections.pkl')
+    with open(cache_file) as f:
+        detections = cPickle.load(f)
+
+    # eval
+    imdb.evaluate_detections(detections)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='imdb test')
+    # general
+    parser.add_argument('--network', help='network name', default=default.network, type=str)
+    parser.add_argument('--dataset', help='dataset name', default=default.dataset, type=str)
+    args, rest = parser.parse_known_args()
+    generate_config(args.network, args.dataset)
+    parser.add_argument('--image_set', help='image_set name', default=default.image_set, type=str)
+    parser.add_argument('--root_path', help='output data folder', default=default.root_path, type=str)
+    parser.add_argument('--dataset_path', help='dataset path', default=default.dataset_path, type=str)
+    # other
+    parser.add_argument('--no_shuffle', help='disable random shuffle', action='store_true')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    print 'Called with argument:', args
+    reeval(args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/deeplab/function/test_deeplab.py b/deeplab/function/test_deeplab.py
new file mode 100644
index 0000000..0853c41
--- /dev/null
+++ b/deeplab/function/test_deeplab.py
@@ -0,0 +1,78 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Zheng Zhang
+# --------------------------------------------------------
+
+import argparse
+import pprint
+import logging
+import time
+import os
+import mxnet as mx
+
+from config.config import config, generate_config, update_config
+from config.dataset_conf import dataset
+from config.network_conf import network
+from symbols import *
+from dataset import *
+from core.loader import TestDataLoader
+from core.tester import Predictor, pred_eval
+from utils.load_model import load_param
+
+def test_deeplab(network, dataset, image_set, root_path, dataset_path,
+              ctx, prefix, epoch,
+              vis, logger=None, output_path=None):
+    if not logger:
+        assert False, 'require a logger'
+
+    # print config
+    pprint.pprint(config)
+    logger.info('testing config:{}\n'.format(pprint.pformat(config)))
+
+    # load symbol and testing data
+    sym = eval('get_' + network + '_test')(num_classes=config.dataset.NUM_CLASSES)
+    imdb = eval(dataset)(image_set, root_path, dataset_path, result_path=output_path)
+    segdb = imdb.gt_segdb()
+
+    # get test data iter
+    test_data = TestDataLoader(segdb, batch_size=len(ctx))
+
+    # load model
+    # arg_params, aux_params = load_param(prefix, epoch, convert=True, ctx=ctx, process=True)
+    arg_params, aux_params = load_param(prefix, epoch, process=True)
+
+    # infer shape
+    data_shape_dict = dict(test_data.provide_data_single)
+    arg_shape, _, aux_shape = sym.infer_shape(**data_shape_dict)
+    arg_shape_dict = dict(zip(sym.list_arguments(), arg_shape))
+    aux_shape_dict = dict(zip(sym.list_auxiliary_states(), aux_shape))
+
+    # check parameters
+    for k in sym.list_arguments():
+        if k in data_shape_dict or k in ['softmax_label']:
+            continue
+        assert k in arg_params, k + ' not initialized'
+        assert arg_params[k].shape == arg_shape_dict[k], \
+            'shape inconsistent for ' + k + ' inferred ' + str(arg_shape_dict[k]) + ' provided ' + str(arg_params[k].shape)
+    for k in sym.list_auxiliary_states():
+        assert k in aux_params, k + ' not initialized'
+        assert aux_params[k].shape == aux_shape_dict[k], \
+            'shape inconsistent for ' + k + ' inferred ' + str(aux_shape_dict[k]) + ' provided ' + str(aux_params[k].shape)
+
+    # decide maximum shape
+    data_names = [k[0] for k in test_data.provide_data_single]
+    label_names = ['softmax_label']
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]]
+
+    # create predictor
+    predictor = Predictor(sym, data_names, label_names,
+                          context=ctx, max_data_shapes=max_data_shape,
+                          provide_data=test_data.provide_data, provide_label=test_data.provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # start detection
+    pred_eval(predictor, test_data, imdb, vis=vis, logger=logger)
+
diff --git a/deeplab/symbols/__init__.py b/deeplab/symbols/__init__.py
new file mode 100644
index 0000000..54c71c0
--- /dev/null
+++ b/deeplab/symbols/__init__.py
@@ -0,0 +1,2 @@
+import resnet_v1_101_deeplab
+import resnet_v1_101_deeplab_dcn
diff --git a/deeplab/symbols/resnet_v1_101_deeplab.py b/deeplab/symbols/resnet_v1_101_deeplab.py
new file mode 100644
index 0000000..fc41aee
--- /dev/null
+++ b/deeplab/symbols/resnet_v1_101_deeplab.py
@@ -0,0 +1,828 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+
+class resnet_v1_101_deeplab(Symbol):
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 4096
+        self.units = (3, 4, 23, 3) # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_conv(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False, eps = self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a,
+                                                    act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b,
+                                                    act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a,
+                                                    act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b,
+                                                    act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a,
+                                                    act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b,
+                                                    act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a,
+                                                    act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b,
+                                                    act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a,
+                                                    act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b,
+                                                    act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a,
+                                                    act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b,
+                                                    act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a,
+                                                    act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b,
+                                                    act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a,
+                                                    act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b,
+                                                    act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a,
+                                                    act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b,
+                                                    act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a,
+                                                    act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b,
+                                                    act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a,
+                                                    act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b,
+                                                    act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a,
+                                                    act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b,
+                                                    act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a,
+                                                     act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b,
+                                                     act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a,
+                                                     act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b,
+                                                     act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a,
+                                                     act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b,
+                                                     act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a,
+                                                     act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b,
+                                                     act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a,
+                                                     act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b,
+                                                     act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a,
+                                                     act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b,
+                                                     act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a,
+                                                     act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b,
+                                                     act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a,
+                                                     act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b,
+                                                     act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a,
+                                                     act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b,
+                                                     act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a,
+                                                     act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b,
+                                                     act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a,
+                                                     act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b,
+                                                     act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a,
+                                                     act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b,
+                                                     act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a,
+                                                     act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b,
+                                                     act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=res4b22_relu, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=res4b22_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b = mx.symbol.Convolution(name='res5a_branch2b', data=res5a_branch2a_relu, num_filter=512,
+                                               pad=(2, 2), kernel=(3, 3), dilate=(2, 2), stride=(1, 1), no_bias=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b = mx.symbol.Convolution(name='res5b_branch2b', data=res5b_branch2a_relu, num_filter=512,
+                                               pad=(2, 2), kernel=(3, 3), dilate=(2, 2), stride=(1, 1), no_bias=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b = mx.symbol.Convolution(name='res5c_branch2b', data=res5c_branch2a_relu, num_filter=512,
+                                               pad=(2, 2), kernel=(3, 3), dilate=(2, 2), stride=(1, 1), no_bias=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+        return res5c_relu
+
+    def get_train_symbol(self, num_classes):
+        """
+        get symbol for training
+        :param num_classes: num of classes
+        :return: the symbol for training
+        """
+        data = mx.symbol.Variable(name="data")
+        seg_cls_gt = mx.symbol.Variable(name='label')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_conv(data)
+
+        # subsequent fc layers by haozhi
+        fc6_bias = mx.symbol.Variable('fc6_bias', lr_mult=2.0)
+        fc6_weight = mx.symbol.Variable('fc6_weight', lr_mult=1.0)
+
+        fc6 = mx.symbol.Convolution(data=conv_feat, kernel=(1, 1), pad=(0, 0), num_filter=1024, name="fc6",
+                                    bias=fc6_bias, weight=fc6_weight, workspace=self.workspace)
+        relu_fc6 = mx.sym.Activation(data=fc6, act_type='relu', name='relu_fc6')
+
+        score_bias = mx.symbol.Variable('score_bias', lr_mult=2.0)
+        score_weight = mx.symbol.Variable('score_weight', lr_mult=1.0)
+
+        score = mx.symbol.Convolution(data=relu_fc6, kernel=(1, 1), pad=(0, 0), num_filter=num_classes, name="score",
+                                      bias=score_bias, weight=score_weight, workspace=self.workspace)
+
+        upsampling = mx.symbol.Deconvolution(data=score, num_filter=num_classes, kernel=(32, 32), stride=(16, 16),
+                                             num_group=num_classes, no_bias=True, name='upsampling',
+                                             attr={'lr_mult': '0.0'}, workspace=self.workspace)
+
+        croped_score = mx.symbol.Crop(*[upsampling, data], offset=(8, 8), name='croped_score')
+        softmax = mx.symbol.SoftmaxOutput(data=croped_score, label=seg_cls_gt, normalization='valid', multi_output=True,
+                                          use_ignore=True, ignore_label=255, name="softmax")
+
+        return softmax
+
+    def get_test_symbol(self, num_classes):
+        """
+        get symbol for testing
+        :param num_classes: num of classes
+        :return: the symbol for testing
+        """
+        data = mx.symbol.Variable(name="data")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_conv(data)
+
+        fc6_bias = mx.symbol.Variable('fc6_bias', lr_mult=2.0)
+        fc6_weight = mx.symbol.Variable('fc6_weight', lr_mult=1.0)
+
+        fc6 = mx.symbol.Convolution(
+            data=conv_feat, kernel=(1, 1), pad=(0, 0), num_filter=1024, name="fc6", bias=fc6_bias, weight=fc6_weight,
+            workspace=self.workspace)
+        relu_fc6 = mx.sym.Activation(data=fc6, act_type='relu', name='relu_fc6')
+
+        score_bias = mx.symbol.Variable('score_bias', lr_mult=2.0)
+        score_weight = mx.symbol.Variable('score_weight', lr_mult=1.0)
+
+        score = mx.symbol.Convolution(
+            data=relu_fc6, kernel=(1, 1), pad=(0, 0), num_filter=num_classes, name="score", bias=score_bias,
+            weight=score_weight, workspace=self.workspace)
+
+        upsampling = mx.symbol.Deconvolution(
+            data=score, num_filter=num_classes, kernel=(32, 32), stride=(16, 16), num_group=num_classes, no_bias=True,
+            name='upsampling', attr={'lr_mult': '0.0'}, workspace=self.workspace)
+
+        croped_score = mx.symbol.Crop(*[upsampling, data], offset=(8, 8), name='croped_score')
+
+        softmax = mx.symbol.SoftmaxOutput(data=croped_score, normalization='valid', multi_output=True, use_ignore=True,
+                                          ignore_label=255, name="softmax")
+
+        return softmax
+
+    def get_symbol(self, cfg, is_train=True):
+        """
+        return a generated symbol, it also need to be assigned to self.sym
+        """
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+
+        if is_train:
+            self.sym = self.get_train_symbol(num_classes=num_classes)
+        else:
+            self.sym = self.get_test_symbol(num_classes=num_classes)
+
+        return self.sym
+
+    def init_weights(self, cfg, arg_params, aux_params):
+        arg_params['fc6_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc6_weight'])
+        arg_params['fc6_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc6_bias'])
+        arg_params['score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['score_weight'])
+        arg_params['score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['score_bias'])
+        arg_params['upsampling_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['upsampling_weight'])
+
+        init = mx.init.Initializer()
+        init._init_bilinear('upsample_weight', arg_params['upsampling_weight'])
diff --git a/deeplab/symbols/resnet_v1_101_deeplab_dcn.py b/deeplab/symbols/resnet_v1_101_deeplab_dcn.py
new file mode 100644
index 0000000..ebecc0d
--- /dev/null
+++ b/deeplab/symbols/resnet_v1_101_deeplab_dcn.py
@@ -0,0 +1,852 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+
+class resnet_v1_101_deeplab_dcn(Symbol):
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 4096
+        self.units = (3, 4, 23, 3) # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_conv(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False, eps = self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a,
+                                                    act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b,
+                                                    act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a,
+                                                    act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b,
+                                                    act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a,
+                                                    act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b,
+                                                    act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps = self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256,
+                                               pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024,
+                                               pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps = self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a,
+                                                    act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b,
+                                                    act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a,
+                                                    act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b,
+                                                    act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a,
+                                                    act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b,
+                                                    act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a,
+                                                    act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b,
+                                                    act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a,
+                                                    act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b,
+                                                    act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a,
+                                                    act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b,
+                                                    act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a,
+                                                    act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b,
+                                                    act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a,
+                                                    act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b,
+                                                    act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a,
+                                                    act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b,
+                                                    act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps = self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a,
+                                                     act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b,
+                                                     act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a,
+                                                     act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b,
+                                                     act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a,
+                                                     act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b,
+                                                     act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a,
+                                                     act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b,
+                                                     act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a,
+                                                     act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b,
+                                                     act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a,
+                                                     act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b,
+                                                     act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a,
+                                                     act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b,
+                                                     act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a,
+                                                     act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b,
+                                                     act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a,
+                                                     act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b,
+                                                     act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a,
+                                                     act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b,
+                                                     act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a,
+                                                     act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b,
+                                                     act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a,
+                                                     act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b,
+                                                     act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a,
+                                                     act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b,
+                                                     act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps = self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=res4b22_relu, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=res4b22_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b_offset_weight = mx.symbol.Variable('res5a_branch2b_offset_weight', lr_mult=1.0)
+        res5a_branch2b_offset_bias = mx.symbol.Variable('res5a_branch2b_offset_bias', lr_mult=2.0)
+        res5a_branch2b_offset = mx.symbol.Convolution(name='res5a_branch2b_offset', data = res5a_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1),
+                                                      weight=res5a_branch2b_offset_weight, bias=res5a_branch2b_offset_bias)
+        res5a_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5a_branch2b', data=res5a_branch2a_relu, offset=res5a_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b_offset_weight = mx.symbol.Variable('res5b_branch2b_offset_weight', lr_mult=1.0)
+        res5b_branch2b_offset_bias = mx.symbol.Variable('res5b_branch2b_offset_bias', lr_mult=2.0)
+        res5b_branch2b_offset = mx.symbol.Convolution(name='res5b_branch2b_offset', data = res5b_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1),
+                                                      weight=res5b_branch2b_offset_weight, bias=res5b_branch2b_offset_bias)
+        res5b_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5b_branch2b', data=res5b_branch2a_relu, offset=res5b_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b_offset_weight = mx.symbol.Variable('res5c_branch2b_offset_weight', lr_mult=1.0)
+        res5c_branch2b_offset_bias = mx.symbol.Variable('res5c_branch2b_offset_bias', lr_mult=2.0)
+        res5c_branch2b_offset = mx.symbol.Convolution(name='res5c_branch2b_offset', data = res5c_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1),
+                                                      weight=res5c_branch2b_offset_weight, bias=res5c_branch2b_offset_bias)
+        res5c_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5c_branch2b', data=res5c_branch2a_relu, offset=res5c_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+
+        return res5c_relu
+
+    def get_train_symbol(self, num_classes):
+        """
+        get symbol for training
+        :param num_classes: num of classes
+        :return: the symbol for training
+        """
+        data = mx.symbol.Variable(name="data")
+        seg_cls_gt = mx.symbol.Variable(name='label')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_conv(data)
+
+        # subsequent fc layers by haozhi
+        fc6_bias = mx.symbol.Variable('fc6_bias', lr_mult=2.0)
+        fc6_weight = mx.symbol.Variable('fc6_weight', lr_mult=1.0)
+
+        fc6 = mx.symbol.Convolution(data=conv_feat, kernel=(1, 1), pad=(0, 0), num_filter=1024, name="fc6",
+                                    bias=fc6_bias, weight=fc6_weight, workspace=self.workspace)
+        relu_fc6 = mx.sym.Activation(data=fc6, act_type='relu', name='relu_fc6')
+
+        score_bias = mx.symbol.Variable('score_bias', lr_mult=2.0)
+        score_weight = mx.symbol.Variable('score_weight', lr_mult=1.0)
+
+        score = mx.symbol.Convolution(data=relu_fc6, kernel=(1, 1), pad=(0, 0), num_filter=num_classes, name="score",
+                                      bias=score_bias, weight=score_weight, workspace=self.workspace)
+
+        upsampling = mx.symbol.Deconvolution(data=score, num_filter=num_classes, kernel=(32, 32), stride=(16, 16),
+                                             num_group=num_classes, no_bias=True, name='upsampling',
+                                             attr={'lr_mult': '0.0'}, workspace=self.workspace)
+
+        croped_score = mx.symbol.Crop(*[upsampling, data], offset=(8, 8), name='croped_score')
+        softmax = mx.symbol.SoftmaxOutput(data=croped_score, label=seg_cls_gt, normalization='valid', multi_output=True,
+                                          use_ignore=True, ignore_label=255, name="softmax")
+
+        return softmax
+
+    def get_test_symbol(self, num_classes):
+        """
+        get symbol for testing
+        :param num_classes: num of classes
+        :return: the symbol for testing
+        """
+        data = mx.symbol.Variable(name="data")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_conv(data)
+
+        fc6_bias = mx.symbol.Variable('fc6_bias', lr_mult=2.0)
+        fc6_weight = mx.symbol.Variable('fc6_weight', lr_mult=1.0)
+
+        fc6 = mx.symbol.Convolution(
+            data=conv_feat, kernel=(1, 1), pad=(0, 0), num_filter=1024, name="fc6", bias=fc6_bias, weight=fc6_weight,
+            workspace=self.workspace)
+        relu_fc6 = mx.sym.Activation(data=fc6, act_type='relu', name='relu_fc6')
+
+        score_bias = mx.symbol.Variable('score_bias', lr_mult=2.0)
+        score_weight = mx.symbol.Variable('score_weight', lr_mult=1.0)
+
+        score = mx.symbol.Convolution(
+            data=relu_fc6, kernel=(1, 1), pad=(0, 0), num_filter=num_classes, name="score", bias=score_bias,
+            weight=score_weight, workspace=self.workspace)
+
+        upsampling = mx.symbol.Deconvolution(
+            data=score, num_filter=num_classes, kernel=(32, 32), stride=(16, 16), num_group=num_classes, no_bias=True,
+            name='upsampling', attr={'lr_mult': '0.0'}, workspace=self.workspace)
+
+        croped_score = mx.symbol.Crop(*[upsampling, data], offset=(8, 8), name='croped_score')
+
+        softmax = mx.symbol.SoftmaxOutput(data=croped_score, normalization='valid', multi_output=True, use_ignore=True,
+                                          ignore_label=255, name="softmax")
+
+        return softmax
+
+    def get_symbol(self, cfg, is_train=True):
+        """
+        return a generated symbol, it also need to be assigned to self.sym
+        """
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+
+        if is_train:
+            self.sym = self.get_train_symbol(num_classes=num_classes)
+        else:
+            self.sym = self.get_test_symbol(num_classes=num_classes)
+
+        return self.sym
+
+    def init_weights(self, cfg, arg_params, aux_params):
+        arg_params['res5a_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_weight'])
+        arg_params['res5a_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_bias'])
+        arg_params['res5b_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_weight'])
+        arg_params['res5b_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_bias'])
+        arg_params['res5c_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_weight'])
+        arg_params['res5c_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_bias'])
+        arg_params['fc6_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc6_weight'])
+        arg_params['fc6_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc6_bias'])
+        arg_params['score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['score_weight'])
+        arg_params['score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['score_bias'])
+        arg_params['upsampling_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['upsampling_weight'])
+
+        init = mx.init.Initializer()
+        init._init_bilinear('upsample_weight', arg_params['upsampling_weight'])
diff --git a/deeplab/test.py b/deeplab/test.py
new file mode 100644
index 0000000..ab59064
--- /dev/null
+++ b/deeplab/test.py
@@ -0,0 +1,102 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import _init_paths
+
+import argparse
+import os
+import sys
+import time
+import logging
+from config.config import config, update_config
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Test a Deeplab Network')
+    # general
+    parser.add_argument('--cfg', help='experiment configure file name', required=True, type=str)
+
+    args, rest = parser.parse_known_args()
+    update_config(args.cfg)
+
+    # testing
+    parser.add_argument('--vis', help='turn on visualization', action='store_true')
+    parser.add_argument('--ignore_cache', help='ignore cached results boxes', action='store_true')
+    parser.add_argument('--shuffle', help='shuffle data on visualization', action='store_true')
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+curr_path = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(curr_path, '../external/mxnet', config.MXNET_VERSION))
+
+import pprint
+import mxnet as mx
+
+from symbols import *
+from dataset import *
+from core.loader import TestDataLoader
+from core.tester import Predictor, pred_eval
+from utils.load_data import load_gt_segdb, merge_segdb
+from utils.load_model import load_param
+from utils.create_logger import create_logger
+
+def test_deeplab():
+    epoch = config.TEST.test_epoch
+    ctx = [mx.gpu(int(i)) for i in config.gpus.split(',')]
+    image_set = config.dataset.test_image_set
+    root_path = config.dataset.root_path
+    dataset = config.dataset.dataset
+    dataset_path = config.dataset.dataset_path
+
+    logger, final_output_path = create_logger(config.output_path, args.cfg, image_set)
+    prefix = os.path.join(final_output_path, '..', '_'.join([iset for iset in config.dataset.image_set.split('+')]), config.TRAIN.model_prefix)
+
+    # print config
+    pprint.pprint(config)
+    logger.info('testing config:{}\n'.format(pprint.pformat(config)))
+
+    # load symbol and testing data
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol(config, is_train=False)
+
+    imdb = eval(dataset)(image_set, root_path, dataset_path, result_path=final_output_path)
+    segdb = imdb.gt_segdb()
+
+    # get test data iter
+    test_data = TestDataLoader(segdb, config=config, batch_size=len(ctx))
+
+    # infer shape
+    data_shape_dict = dict(test_data.provide_data_single)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # load model and check parameters
+    arg_params, aux_params = load_param(prefix, epoch, process=True)
+
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict, is_train=False)
+
+    # decide maximum shape
+    data_names = [k[0] for k in test_data.provide_data_single]
+    label_names = ['softmax_label']
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]]
+
+    # create predictor
+    predictor = Predictor(sym, data_names, label_names,
+                          context=ctx, max_data_shapes=max_data_shape,
+                          provide_data=test_data.provide_data, provide_label=test_data.provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # start detection
+    pred_eval(predictor, test_data, imdb, vis=args.vis, ignore_cache=args.ignore_cache, logger=logger)
+
+def main():
+    print args
+    test_deeplab()
+
+
+if __name__ == '__main__':
+    main()
diff --git a/deeplab/train.py b/deeplab/train.py
new file mode 100644
index 0000000..fb2722a
--- /dev/null
+++ b/deeplab/train.py
@@ -0,0 +1,162 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import _init_paths
+
+import time
+import argparse
+import logging
+import pprint
+import os
+import sys
+from config.config import config, update_config
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train deeplab network')
+    # general
+    parser.add_argument('--cfg', help='experiment configure file name', required=True, type=str)
+
+    args, rest = parser.parse_known_args()
+    # update config
+    update_config(args.cfg)
+
+    # training
+    parser.add_argument('--frequent', help='frequency of logging', default=config.default.frequent, type=int)
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+curr_path = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(curr_path, '../external/mxnet', config.MXNET_VERSION))
+
+
+import shutil
+import numpy as np
+import mxnet as mx
+
+from symbols import *
+from core import callback, metric
+from core.loader import TrainDataLoader
+from core.module import MutableModule
+from utils.load_data import load_gt_segdb, merge_segdb
+from utils.load_model import load_param
+from utils.PrefetchingIter import PrefetchingIter
+from utils.create_logger import create_logger
+from utils.lr_scheduler import WarmupMultiFactorScheduler
+
+def train_net(args, ctx, pretrained, epoch, prefix, begin_epoch, end_epoch, lr, lr_step):
+    logger, final_output_path = create_logger(config.output_path, args.cfg, config.dataset.image_set)
+    prefix = os.path.join(final_output_path, prefix)
+
+    # load symbol
+    shutil.copy2(os.path.join(curr_path, 'symbols', config.symbol + '.py'), final_output_path)
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol(config, is_train=True)
+    #sym = eval('get_' + args.network + '_train')(num_classes=config.dataset.NUM_CLASSES)
+
+    # setup multi-gpu
+    batch_size = len(ctx)
+    input_batch_size = config.TRAIN.BATCH_IMAGES * batch_size
+
+    # print config
+    pprint.pprint(config)
+    logger.info('training config:{}\n'.format(pprint.pformat(config)))
+
+    # load dataset and prepare imdb for training
+    image_sets = [iset for iset in config.dataset.image_set.split('+')]
+    segdbs = [load_gt_segdb(config.dataset.dataset, image_set, config.dataset.root_path, config.dataset.dataset_path,
+                            result_path=final_output_path, flip=config.TRAIN.FLIP)
+              for image_set in image_sets]
+    segdb = merge_segdb(segdbs)
+
+    # load training data
+    train_data = TrainDataLoader(sym, segdb, config, batch_size=input_batch_size, crop_height=config.TRAIN.CROP_HEIGHT, crop_width=config.TRAIN.CROP_WIDTH,
+                                 shuffle=config.TRAIN.SHUFFLE, ctx=ctx)
+
+    # infer max shape
+    max_scale = [(config.TRAIN.CROP_HEIGHT, config.TRAIN.CROP_WIDTH)]
+    max_data_shape = [('data', (config.TRAIN.BATCH_IMAGES, 3, max([v[0] for v in max_scale]), max([v[1] for v in max_scale])))]
+    max_label_shape = [('label', (config.TRAIN.BATCH_IMAGES, 1, max([v[0] for v in max_scale]), max([v[1] for v in max_scale])))]
+    max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape, max_label_shape)
+    print 'providing maximum shape', max_data_shape, max_label_shape
+
+    # infer shape
+    data_shape_dict = dict(train_data.provide_data_single + train_data.provide_label_single)
+    pprint.pprint(data_shape_dict)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # load and initialize params
+    if config.TRAIN.RESUME:
+        print 'continue training from ', begin_epoch
+        arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
+    else:
+        print pretrained
+        arg_params, aux_params = load_param(pretrained, epoch, convert=True)
+        sym_instance.init_weights(config, arg_params, aux_params)
+
+    # check parameter shapes
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict)
+
+    # create solver
+    fixed_param_prefix = config.network.FIXED_PARAMS
+    data_names = [k[0] for k in train_data.provide_data_single]
+    label_names = [k[0] for k in train_data.provide_label_single]
+
+    mod = MutableModule(sym, data_names=data_names, label_names=label_names,
+                        logger=logger, context=ctx, max_data_shapes=[max_data_shape for _ in xrange(batch_size)],
+                        max_label_shapes=[max_label_shape for _ in xrange(batch_size)], fixed_param_prefix=fixed_param_prefix)
+
+    # decide training params
+    # metric
+    fcn_loss_metric = metric.FCNLogLossMetric(config.default.frequent * batch_size)
+    eval_metrics = mx.metric.CompositeEvalMetric()
+
+    # rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric
+    for child_metric in [fcn_loss_metric]:
+        eval_metrics.add(child_metric)
+
+    # callback
+    batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=args.frequent)
+    epoch_end_callback = mx.callback.module_checkpoint(mod, prefix, period=1, save_optimizer_states=True)
+
+    # decide learning rate
+    base_lr = lr
+    lr_factor = 0.1
+    lr_epoch = [float(epoch) for epoch in lr_step.split(',')]
+    lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
+    lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
+    lr_iters = [int(epoch * len(segdb) / batch_size) for epoch in lr_epoch_diff]
+    print 'lr', lr, 'lr_epoch_diff', lr_epoch_diff, 'lr_iters', lr_iters
+
+    lr_scheduler = WarmupMultiFactorScheduler(lr_iters, lr_factor, config.TRAIN.warmup, config.TRAIN.warmup_lr, config.TRAIN.warmup_step)
+
+    # optimizer
+    optimizer_params = {'momentum': config.TRAIN.momentum,
+                        'wd': config.TRAIN.wd,
+                        'learning_rate': lr,
+                        'lr_scheduler': lr_scheduler,
+                        'rescale_grad': 1.0,
+                        'clip_gradient': None}
+
+    if not isinstance(train_data, PrefetchingIter):
+        train_data = PrefetchingIter(train_data)
+
+    # train
+    mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
+            batch_end_callback=batch_end_callback, kvstore=config.default.kvstore,
+            optimizer='sgd', optimizer_params=optimizer_params,
+            arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
+
+def main():
+    print 'Called with argument:', args
+    ctx = [mx.gpu(int(i)) for i in config.gpus.split(',')]
+    train_net(args, ctx, config.network.pretrained, config.network.pretrained_epoch, config.TRAIN.model_prefix,
+              config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TRAIN.lr_step)
+
+if __name__ == '__main__':
+    main()
diff --git a/demo/deform_conv/000240.jpg b/demo/deform_conv/000240.jpg
new file mode 100644
index 0000000..6becaf4
Binary files /dev/null and b/demo/deform_conv/000240.jpg differ
diff --git a/demo/deform_conv/000437.jpg b/demo/deform_conv/000437.jpg
new file mode 100644
index 0000000..0a90a4a
Binary files /dev/null and b/demo/deform_conv/000437.jpg differ
diff --git a/demo/deform_conv/004072.jpg b/demo/deform_conv/004072.jpg
new file mode 100644
index 0000000..c20ed5c
Binary files /dev/null and b/demo/deform_conv/004072.jpg differ
diff --git a/demo/deform_conv/007912.jpg b/demo/deform_conv/007912.jpg
new file mode 100644
index 0000000..02f6466
Binary files /dev/null and b/demo/deform_conv/007912.jpg differ
diff --git a/demo/deform_psroi/000057.jpg b/demo/deform_psroi/000057.jpg
new file mode 100644
index 0000000..0483cf8
Binary files /dev/null and b/demo/deform_psroi/000057.jpg differ
diff --git a/demo/deform_psroi/000149.jpg b/demo/deform_psroi/000149.jpg
new file mode 100644
index 0000000..574a200
Binary files /dev/null and b/demo/deform_psroi/000149.jpg differ
diff --git a/demo/deform_psroi/000351.jpg b/demo/deform_psroi/000351.jpg
new file mode 100644
index 0000000..d9d6b92
Binary files /dev/null and b/demo/deform_psroi/000351.jpg differ
diff --git a/demo/deform_psroi/002535.jpg b/demo/deform_psroi/002535.jpg
new file mode 100644
index 0000000..6b56d11
Binary files /dev/null and b/demo/deform_psroi/002535.jpg differ
diff --git a/demo/frankfurt_000001_073088_leftImg8bit.png b/demo/frankfurt_000001_073088_leftImg8bit.png
new file mode 100644
index 0000000..9605e69
Binary files /dev/null and b/demo/frankfurt_000001_073088_leftImg8bit.png differ
diff --git a/demo/lindau_000024_000019_leftImg8bit.png b/demo/lindau_000024_000019_leftImg8bit.png
new file mode 100644
index 0000000..3c6217b
Binary files /dev/null and b/demo/lindau_000024_000019_leftImg8bit.png differ
diff --git a/experiments/deeplab/cfgs/deeplab_cityscapes_demo.yaml b/experiments/deeplab/cfgs/deeplab_cityscapes_demo.yaml
new file mode 100644
index 0000000..aac0e21
--- /dev/null
+++ b/experiments/deeplab/cfgs/deeplab_cityscapes_demo.yaml
@@ -0,0 +1,71 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/cityscape"
+symbol: resnet_v1_101_deeplab
+gpus: '0'
+SCALES:
+- 1024
+- 2048
+default:
+  frequent: 10
+  kvstore: device
+dataset:
+  NUM_CLASSES: 19
+  dataset: CityScape
+  dataset_path: "./data/cityscapes/"
+  image_set: leftImg8bit_train
+  root_path: "./data/"
+  test_image_set: leftImg8bit_val
+network:
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  IMAGE_STRIDE: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+TRAIN:
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 53
+  lr: 0.0005
+  lr_step: '40.336'
+  model_prefix: "deeplab_resnet_v1_101_cityscapes_segmentation_dcn"
+  # whether flip image
+  FLIP: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # wheter crop image during training
+  ENABLE_CROP: True
+  # scale of cropped image during training
+  CROP_HEIGHT: 768
+  CROP_WIDTH: 1024
+  # whether resume training
+  RESUME: false
+  # whether shuffle image
+  SHUFFLE: true
+TEST:
+  # size of images for each device
+  BATCH_IMAGES: 1
+  test_epoch: 53
diff --git a/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_base.yaml b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_base.yaml
new file mode 100644
index 0000000..b595850
--- /dev/null
+++ b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_base.yaml
@@ -0,0 +1,71 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/cityscape"
+symbol: resnet_v1_101_deeplab
+gpus: '0'
+SCALES:
+- 1024
+- 2048
+default:
+  frequent: 10
+  kvstore: device
+dataset:
+  NUM_CLASSES: 19
+  dataset: CityScape
+  dataset_path: "./data/cityscapes/"
+  image_set: leftImg8bit_train
+  root_path: "./data/"
+  test_image_set: leftImg8bit_val
+network:
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  IMAGE_STRIDE: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+TRAIN:
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 53
+  lr: 0.0005
+  lr_step: '40.336'
+  model_prefix: "deeplab_resnet_v1_101_cityscapes_segmentation_base"
+  # whether flip image
+  FLIP: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # wheter crop image during training
+  ENABLE_CROP: True
+  # scale of cropped image during training
+  CROP_HEIGHT: 768
+  CROP_WIDTH: 1024
+  # whether resume training
+  RESUME: false
+  # whether shuffle image
+  SHUFFLE: true
+TEST:
+  # size of images for each device
+  BATCH_IMAGES: 1
+  test_epoch: 53
diff --git a/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_dcn.yaml b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_dcn.yaml
new file mode 100644
index 0000000..6c4f0c9
--- /dev/null
+++ b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_cityscapes_segmentation_dcn.yaml
@@ -0,0 +1,71 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/cityscape"
+symbol: resnet_v1_101_deeplab_dcn
+gpus: '0'
+SCALES:
+- 1024
+- 2048
+default:
+  frequent: 10
+  kvstore: device
+dataset:
+  NUM_CLASSES: 19
+  dataset: CityScape
+  dataset_path: "./data/cityscapes/"
+  image_set: leftImg8bit_train
+  root_path: "./data/"
+  test_image_set: leftImg8bit_val
+network:
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  IMAGE_STRIDE: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+TRAIN:
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 53
+  lr: 0.0005
+  lr_step: '40.336'
+  model_prefix: "deeplab_resnet_v1_101_cityscapes_segmentation_dcn"
+  # whether flip image
+  FLIP: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # wheter crop image during training
+  ENABLE_CROP: True
+  # scale of cropped image during training
+  CROP_HEIGHT: 768
+  CROP_WIDTH: 1024
+  # whether resume training
+  RESUME: false
+  # whether shuffle image
+  SHUFFLE: true
+TEST:
+  # size of images for each device
+  BATCH_IMAGES: 1
+  test_epoch: 53
diff --git a/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_base.yaml b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_base.yaml
new file mode 100644
index 0000000..b8d51de
--- /dev/null
+++ b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_base.yaml
@@ -0,0 +1,71 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/voc12"
+symbol: resnet_v1_101_deeplab
+gpus: '0'
+SCALES:
+- 360
+- 600
+default:
+  frequent: 10
+  kvstore: device
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit2012/"
+  image_set: 2012_train_seg
+  root_path: "./data/"
+  test_image_set: 2012_val_seg
+network:
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  IMAGE_STRIDE: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+TRAIN:
+  warmup: false
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 12
+  lr: 0.0005
+  lr_step: '8'
+  model_prefix: "deeplab_resnet_v1_101_voc12_segmentation_base"
+  # whether flip image
+  FLIP: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # wheter crop image during training
+  ENABLE_CROP: False
+  # scale of cropped image during training
+  CROP_HEIGHT: 768
+  CROP_WIDTH: 1024
+  # whether resume training
+  RESUME: false
+  # whether shuffle image
+  SHUFFLE: true
+TEST:
+  # size of images for each device
+  BATCH_IMAGES: 1
+  test_epoch: 12
diff --git a/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_dcn.yaml b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_dcn.yaml
new file mode 100644
index 0000000..256134b
--- /dev/null
+++ b/experiments/deeplab/cfgs/deeplab_resnet_v1_101_voc12_segmentation_dcn.yaml
@@ -0,0 +1,71 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/voc12"
+symbol: resnet_v1_101_deeplab_dcn
+gpus: '0'
+SCALES:
+- 360
+- 600
+default:
+  frequent: 10
+  kvstore: device
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit2012/"
+  image_set: 2012_train_seg
+  root_path: "./data/"
+  test_image_set: 2012_val_seg
+network:
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  IMAGE_STRIDE: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+TRAIN:
+  warmup: false
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 12
+  lr: 0.0005
+  lr_step: '8'
+  model_prefix: "deeplab_resnet_v1_101_voc12_segmentation_dcn"
+  # whether flip image
+  FLIP: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # wheter crop image during training
+  ENABLE_CROP: False
+  # scale of cropped image during training
+  CROP_HEIGHT: 768
+  CROP_WIDTH: 1024
+  # whether resume training
+  RESUME: false
+  # whether shuffle image
+  SHUFFLE: true
+TEST:
+  # size of images for each device
+  BATCH_IMAGES: 1
+  test_epoch: 12
diff --git a/experiments/deeplab/deeplab_test.py b/experiments/deeplab/deeplab_test.py
new file mode 100644
index 0000000..5ba2f67
--- /dev/null
+++ b/experiments/deeplab/deeplab_test.py
@@ -0,0 +1,24 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import os
+import sys
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+this_dir = os.path.dirname(__file__)
+sys.path.insert(0, os.path.join(this_dir, '..', '..', 'deeplab'))
+
+import test
+
+if __name__ == "__main__":
+    test.main()
+
+
+
+
diff --git a/experiments/deeplab/deeplab_train_test.py b/experiments/deeplab/deeplab_train_test.py
new file mode 100644
index 0000000..22fb362
--- /dev/null
+++ b/experiments/deeplab/deeplab_train_test.py
@@ -0,0 +1,26 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
+import os
+import sys
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+this_dir = os.path.dirname(__file__)
+sys.path.insert(0, os.path.join(this_dir, '..', '..', 'deeplab'))
+
+import train
+import test
+
+if __name__ == "__main__":
+    train.main()
+    test.main()
+
+
+
+
diff --git a/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_dcn_end2end.yaml b/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_dcn_end2end.yaml
new file mode 100644
index 0000000..90b92ca
--- /dev/null
+++ b/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_dcn_end2end.yaml
@@ -0,0 +1,154 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rcnn/coco"
+symbol: resnet_v1_101_rcnn_dcn
+gpus: '0,1,2,3'
+CLASS_AGNOSTIC: false
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 4
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 12
+dataset:
+  NUM_CLASSES: 81
+  dataset: coco
+  dataset_path: "./data/coco"
+  image_set: train2014+val2014
+  root_path: "./data"
+  test_image_set: test-dev2015
+  proposal: rpn
+TRAIN:
+  lr: 0.0005
+  lr_step: '5.333'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 8000 warmup step for single GPU for COCO
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 8
+  model_prefix: 'rcnn_coco'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: false
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: true
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: 128
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.1
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 8
+  max_per_image: 100
+
diff --git a/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_end2end.yaml b/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_end2end.yaml
new file mode 100644
index 0000000..59439e2
--- /dev/null
+++ b/experiments/faster_rcnn/cfgs/resnet_v1_101_coco_trainval_rcnn_end2end.yaml
@@ -0,0 +1,154 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rcnn/coco"
+symbol: resnet_v1_101_rcnn
+gpus: '0,1,2,3'
+CLASS_AGNOSTIC: false
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 4
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 12
+dataset:
+  NUM_CLASSES: 81
+  dataset: coco
+  dataset_path: "./data/coco"
+  image_set: train2014+val2014
+  root_path: "./data"
+  test_image_set: test-dev2015
+  proposal: rpn
+TRAIN:
+  lr: 0.0005
+  lr_step: '5.333'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 8000 warmup step for single GPU for COCO
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 8
+  model_prefix: 'rcnn_coco'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: false
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: true
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: 128
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.1
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 8
+  max_per_image: 100
+
diff --git a/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_dcn_end2end.yaml b/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_dcn_end2end.yaml
new file mode 100644
index 0000000..8e2701e
--- /dev/null
+++ b/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_dcn_end2end.yaml
@@ -0,0 +1,152 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rcnn/voc"
+symbol: resnet_v1_101_rcnn_dcn
+gpus: '0,1,2,3'
+CLASS_AGNOSTIC: false
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 9
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit"
+  image_set: 2007_trainval+2012_trainval
+  root_path: "./data"
+  test_image_set: 2007_test
+  proposal: rpn
+TRAIN:
+  lr: 0.0005
+  lr_step: '4.83'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU on VOC
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 7
+  model_prefix: 'rcnn_voc'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: false
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: true
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: 128
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.1
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 7
+
diff --git a/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_end2end.yaml b/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_end2end.yaml
new file mode 100644
index 0000000..6fad75f
--- /dev/null
+++ b/experiments/faster_rcnn/cfgs/resnet_v1_101_voc0712_rcnn_end2end.yaml
@@ -0,0 +1,152 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rcnn/voc"
+symbol: resnet_v1_101_rcnn
+gpus: '0,1,2,3'
+CLASS_AGNOSTIC: false
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 9
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit"
+  image_set: 2007_trainval+2012_trainval
+  root_path: "./data"
+  test_image_set: 2007_test
+  proposal: rpn
+TRAIN:
+  lr: 0.0005
+  lr_step: '4.83'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU on VOC
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 7
+  model_prefix: 'rcnn_voc'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: false
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: true
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: 128
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.1
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 7
+
diff --git a/experiments/faster_rcnn/rcnn_end2end_train_test.py b/experiments/faster_rcnn/rcnn_end2end_train_test.py
new file mode 100644
index 0000000..5598dd0
--- /dev/null
+++ b/experiments/faster_rcnn/rcnn_end2end_train_test.py
@@ -0,0 +1,25 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Guodong Zhang
+# --------------------------------------------------------
+import os
+import sys
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+#os.environ['MXNET_ENGINE_TYPE'] = 'NaiveEngine'
+this_dir = os.path.dirname(__file__)
+sys.path.insert(0, os.path.join(this_dir, '..', '..', 'faster_rcnn'))
+
+import train_end2end
+import test
+
+if __name__ == "__main__":
+    train_end2end.main()
+    test.main()
+
+
+
+
diff --git a/experiments/faster_rcnn/rcnn_test.py b/experiments/faster_rcnn/rcnn_test.py
new file mode 100644
index 0000000..a4ece9a
--- /dev/null
+++ b/experiments/faster_rcnn/rcnn_test.py
@@ -0,0 +1,19 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Guodong Zhang
+# --------------------------------------------------------
+
+import os
+import sys
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+this_dir = os.path.dirname(__file__)
+sys.path.insert(0, os.path.join(this_dir, '..', '..', 'faster_rcnn'))
+
+import test
+
+if __name__ == "__main__":
+    test.main()
diff --git a/experiments/faster_rcnn/rcnn_train_test.py b/experiments/faster_rcnn/rcnn_train_test.py
new file mode 100644
index 0000000..38f8540
--- /dev/null
+++ b/experiments/faster_rcnn/rcnn_train_test.py
@@ -0,0 +1,25 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Guodong Zhang
+# --------------------------------------------------------
+
+import os
+import sys
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+this_dir = os.path.dirname(__file__)
+sys.path.insert(0, os.path.join(this_dir, '..', '..', 'faster_rcnn'))
+
+import train_rcnn
+import test
+
+if __name__ == "__main__":
+    train_rcnn.main()
+    test.main()
+
+
+
+
diff --git a/experiments/rfcn/cfgs/deform_conv_demo.yaml b/experiments/rfcn/cfgs/deform_conv_demo.yaml
new file mode 100644
index 0000000..6e144f8
--- /dev/null
+++ b/experiments/rfcn/cfgs/deform_conv_demo.yaml
@@ -0,0 +1,152 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rfcn"
+symbol: deform_conv_demo
+gpus: '0'
+CLASS_AGNOSTIC: true
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 9
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit"
+  image_set: 2007_trainval+2012_trainval
+  root_path: "./data"
+  test_image_set: 2007_test
+  proposal: rpn
+TRAIN:
+  lr: 0.0005
+  lr_step: '4.83'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU on VOC
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 8
+  model_prefix: 'rfcn_voc'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: true
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: true
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: -1
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.0
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: true
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 7
+
diff --git a/experiments/rfcn/cfgs/deform_psroi_demo.yaml b/experiments/rfcn/cfgs/deform_psroi_demo.yaml
new file mode 100644
index 0000000..4961749
--- /dev/null
+++ b/experiments/rfcn/cfgs/deform_psroi_demo.yaml
@@ -0,0 +1,152 @@
+---
+MXNET_VERSION: "mxnet"
+output_path: "./output/rfcn"
+symbol: deform_psroi_demo
+gpus: '0'
+CLASS_AGNOSTIC: true
+SCALES:
+- 600
+- 1000
+default:
+  frequent: 100
+  kvstore: device
+network:
+  pretrained: "./model/pretrained_model/resnet_v1_101"
+  pretrained_epoch: 0
+  PIXEL_MEANS:
+  - 103.06
+  - 115.90
+  - 123.15
+  IMAGE_STRIDE: 0
+  RCNN_FEAT_STRIDE: 16
+  RPN_FEAT_STRIDE: 16
+  FIXED_PARAMS:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - gamma
+  - beta
+  FIXED_PARAMS_SHARED:
+  - conv1
+  - bn_conv1
+  - res2
+  - bn2
+  - res3
+  - bn3
+  - res4
+  - bn4
+  - gamma
+  - beta
+  ANCHOR_RATIOS:
+  - 0.5
+  - 1
+  - 2
+  ANCHOR_SCALES:
+  - 8
+  - 16
+  - 32
+  NUM_ANCHORS: 9
+dataset:
+  NUM_CLASSES: 21
+  dataset: PascalVOC
+  dataset_path: "./data/VOCdevkit"
+  image_set: 2007_trainval+2012_trainval
+  root_path: "./data"
+  test_image_set: 2007_test
+  proposal: selective_search
+TRAIN:
+  lr: 0.0005
+  lr_step: '4.83'
+  warmup: true
+  warmup_lr: 0.00005
+  # typically we will use 4000 warmup step for single GPU on VOC
+  warmup_step: 1000
+  begin_epoch: 0
+  end_epoch: 8
+  model_prefix: 'rfcn_voc'
+  # whether resume training
+  RESUME: false
+  # whether flip image
+  FLIP: true
+  # whether shuffle image
+  SHUFFLE: true
+  # whether use OHEM
+  ENABLE_OHEM: true
+  # size of images for each device, 2 for rcnn, 1 for rpn and e2e
+  BATCH_IMAGES: 1
+  # e2e changes behavior of anchor loader and metric
+  END2END: false
+  # group images with similar aspect ratio
+  ASPECT_GROUPING: true
+  # R-CNN
+  # rcnn rois batch size
+  BATCH_ROIS: -1
+  BATCH_ROIS_OHEM: 128
+  # rcnn rois sampling params
+  FG_FRACTION: 0.25
+  FG_THRESH: 0.5
+  BG_THRESH_HI: 0.5
+  BG_THRESH_LO: 0.0
+  # rcnn bounding box regression params
+  BBOX_REGRESSION_THRESH: 0.5
+  BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+
+  # RPN anchor loader
+  # rpn anchors batch size
+  RPN_BATCH_SIZE: 256
+  # rpn anchors sampling params
+  RPN_FG_FRACTION: 0.5
+  RPN_POSITIVE_OVERLAP: 0.7
+  RPN_NEGATIVE_OVERLAP: 0.3
+  RPN_CLOBBER_POSITIVES: false
+  # rpn bounding box regression params
+  RPN_BBOX_WEIGHTS:
+  - 1.0
+  - 1.0
+  - 1.0
+  - 1.0
+  RPN_POSITIVE_WEIGHT: -1.0
+  # used for end2end training
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # approximate bounding box regression
+  BBOX_NORMALIZATION_PRECOMPUTED: true
+  BBOX_MEANS:
+  - 0.0
+  - 0.0
+  - 0.0
+  - 0.0
+  BBOX_STDS:
+  - 0.1
+  - 0.1
+  - 0.2
+  - 0.2
+TEST:
+  # use rpn to generate proposal
+  HAS_RPN: false
+  # size of images for each device
+  BATCH_IMAGES: 1
+  # RPN proposal
+  CXX_PROPOSAL: false
+  RPN_NMS_THRESH: 0.7
+  RPN_PRE_NMS_TOP_N: 6000
+  RPN_POST_NMS_TOP_N: 300
+  RPN_MIN_SIZE: 0
+  # RPN generate proposal
+  PROPOSAL_NMS_THRESH: 0.7
+  PROPOSAL_PRE_NMS_TOP_N: 20000
+  PROPOSAL_POST_NMS_TOP_N: 2000
+  PROPOSAL_MIN_SIZE: 0
+  # RCNN nms
+  NMS: 0.3
+  test_epoch: 7
+
diff --git a/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_dcn_end2end_ohem.yaml b/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_dcn_end2end_ohem.yaml
index 78198e3..dbb8cd3 100644
--- a/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_dcn_end2end_ohem.yaml
+++ b/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_dcn_end2end_ohem.yaml
@@ -1,6 +1,6 @@
 ---
 MXNET_VERSION: "mxnet"
-output_path: "./output/dcn_rfcn/voc"
+output_path: "./output/rfcn_dcn/voc"
 symbol: resnet_v1_101_rfcn_dcn
 gpus: '0,1,2,3'
 CLASS_AGNOSTIC: true
diff --git a/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_end2end_ohem.yaml b/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_end2end_ohem.yaml
index 496d121..79d0494 100644
--- a/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_end2end_ohem.yaml
+++ b/experiments/rfcn/cfgs/resnet_v1_101_voc0712_rfcn_end2end_ohem.yaml
@@ -1,6 +1,6 @@
 ---
 MXNET_VERSION: "mxnet"
-output_path: "./output/dcn_rfcn/voc"
+output_path: "./output/rfcn/voc"
 symbol: resnet_v1_101_rfcn
 gpus: '0,1,2,3'
 CLASS_AGNOSTIC: true
diff --git a/faster_rcnn/__init__.py b/faster_rcnn/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/faster_rcnn/_init_paths.py b/faster_rcnn/_init_paths.py
new file mode 100644
index 0000000..5bbe057
--- /dev/null
+++ b/faster_rcnn/_init_paths.py
@@ -0,0 +1,11 @@
+import os.path as osp
+import sys
+
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+
+this_dir = osp.dirname(__file__)
+
+lib_path = osp.join(this_dir, '..', 'lib')
+add_path(lib_path)
diff --git a/faster_rcnn/config/__init__.py b/faster_rcnn/config/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/faster_rcnn/config/config.py b/faster_rcnn/config/config.py
new file mode 100644
index 0000000..70845ea
--- /dev/null
+++ b/faster_rcnn/config/config.py
@@ -0,0 +1,188 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong, Bin Xiao
+# --------------------------------------------------------
+
+import yaml
+import numpy as np
+from easydict import EasyDict as edict
+
+config = edict()
+
+config.MXNET_VERSION = ''
+config.output_path = ''
+config.symbol = ''
+config.gpus = ''
+config.CLASS_AGNOSTIC = True
+config.SCALES = [(600, 1000)]  # first is scale (the shorter side); second is max size
+
+# default training
+config.default = edict()
+config.default.frequent = 20
+config.default.kvstore = 'device'
+
+# network related params
+config.network = edict()
+config.network.pretrained = ''
+config.network.pretrained_epoch = 0
+config.network.PIXEL_MEANS = np.array([0, 0, 0])
+config.network.IMAGE_STRIDE = 0
+config.network.RPN_FEAT_STRIDE = 16
+config.network.RCNN_FEAT_STRIDE = 16
+config.network.FIXED_PARAMS = ['gamma', 'beta']
+config.network.FIXED_PARAMS_SHARED = ['gamma', 'beta']
+config.network.ANCHOR_SCALES = (8, 16, 32)
+config.network.ANCHOR_RATIOS = (0.5, 1, 2)
+config.network.NUM_ANCHORS = len(config.network.ANCHOR_SCALES) * len(config.network.ANCHOR_RATIOS)
+
+# dataset related params
+config.dataset = edict()
+config.dataset.dataset = 'PascalVOC'
+config.dataset.image_set = '2007_trainval'
+config.dataset.test_image_set = '2007_test'
+config.dataset.root_path = './data'
+config.dataset.dataset_path = './data/VOCdevkit'
+config.dataset.NUM_CLASSES = 21
+
+
+config.TRAIN = edict()
+
+config.TRAIN.lr = 0
+config.TRAIN.lr_step = ''
+config.TRAIN.lr_factor = 0.1
+config.TRAIN.warmup = False
+config.TRAIN.warmup_lr = 0
+config.TRAIN.warmup_step = 0
+config.TRAIN.momentum = 0.9
+config.TRAIN.wd = 0.0005
+config.TRAIN.begin_epoch = 0
+config.TRAIN.end_epoch = 0
+config.TRAIN.model_prefix = ''
+
+config.TRAIN.ALTERNATE = edict()
+config.TRAIN.ALTERNATE.RPN_BATCH_IMAGES = 0
+config.TRAIN.ALTERNATE.RCNN_BATCH_IMAGES = 0
+config.TRAIN.ALTERNATE.rpn1_lr = 0
+config.TRAIN.ALTERNATE.rpn1_lr_step = ''    # recommend '2'
+config.TRAIN.ALTERNATE.rpn1_epoch = 0       # recommend 3
+config.TRAIN.ALTERNATE.rfcn1_lr = 0
+config.TRAIN.ALTERNATE.rfcn1_lr_step = ''   # recommend '5'
+config.TRAIN.ALTERNATE.rfcn1_epoch = 0      # recommend 8
+config.TRAIN.ALTERNATE.rpn2_lr = 0
+config.TRAIN.ALTERNATE.rpn2_lr_step = ''    # recommend '2'
+config.TRAIN.ALTERNATE.rpn2_epoch = 0       # recommend 3
+config.TRAIN.ALTERNATE.rfcn2_lr = 0
+config.TRAIN.ALTERNATE.rfcn2_lr_step = ''   # recommend '5'
+config.TRAIN.ALTERNATE.rfcn2_epoch = 0      # recommend 8
+# optional
+config.TRAIN.ALTERNATE.rpn3_lr = 0
+config.TRAIN.ALTERNATE.rpn3_lr_step = ''    # recommend '2'
+config.TRAIN.ALTERNATE.rpn3_epoch = 0       # recommend 3
+
+# whether resume training
+config.TRAIN.RESUME = False
+# whether flip image
+config.TRAIN.FLIP = True
+# whether shuffle image
+config.TRAIN.SHUFFLE = True
+# whether use OHEM
+config.TRAIN.ENABLE_OHEM = False
+# size of images for each device, 2 for rcnn, 1 for rpn and e2e
+config.TRAIN.BATCH_IMAGES = 2
+# e2e changes behavior of anchor loader and metric
+config.TRAIN.END2END = False
+# group images with similar aspect ratio
+config.TRAIN.ASPECT_GROUPING = True
+
+# R-CNN
+# rcnn rois batch size
+config.TRAIN.BATCH_ROIS = 128
+config.TRAIN.BATCH_ROIS_OHEM = 128
+# rcnn rois sampling params
+config.TRAIN.FG_FRACTION = 0.25
+config.TRAIN.FG_THRESH = 0.5
+config.TRAIN.BG_THRESH_HI = 0.5
+config.TRAIN.BG_THRESH_LO = 0.0
+# rcnn bounding box regression params
+config.TRAIN.BBOX_REGRESSION_THRESH = 0.5
+config.TRAIN.BBOX_WEIGHTS = np.array([1.0, 1.0, 1.0, 1.0])
+
+# RPN anchor loader
+# rpn anchors batch size
+config.TRAIN.RPN_BATCH_SIZE = 256
+# rpn anchors sampling params
+config.TRAIN.RPN_FG_FRACTION = 0.5
+config.TRAIN.RPN_POSITIVE_OVERLAP = 0.7
+config.TRAIN.RPN_NEGATIVE_OVERLAP = 0.3
+config.TRAIN.RPN_CLOBBER_POSITIVES = False
+# rpn bounding box regression params
+config.TRAIN.RPN_BBOX_WEIGHTS = (1.0, 1.0, 1.0, 1.0)
+config.TRAIN.RPN_POSITIVE_WEIGHT = -1.0
+
+# used for end2end training
+# RPN proposal
+config.TRAIN.CXX_PROPOSAL = True
+config.TRAIN.RPN_NMS_THRESH = 0.7
+config.TRAIN.RPN_PRE_NMS_TOP_N = 12000
+config.TRAIN.RPN_POST_NMS_TOP_N = 2000
+config.TRAIN.RPN_MIN_SIZE = config.network.RPN_FEAT_STRIDE
+# approximate bounding box regression
+config.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED = False
+config.TRAIN.BBOX_MEANS = (0.0, 0.0, 0.0, 0.0)
+config.TRAIN.BBOX_STDS = (0.1, 0.1, 0.2, 0.2)
+
+config.TEST = edict()
+
+# R-CNN testing
+# use rpn to generate proposal
+config.TEST.HAS_RPN = False
+# size of images for each device
+config.TEST.BATCH_IMAGES = 1
+
+# RPN proposal
+config.TEST.CXX_PROPOSAL = True
+config.TEST.RPN_NMS_THRESH = 0.7
+config.TEST.RPN_PRE_NMS_TOP_N = 6000
+config.TEST.RPN_POST_NMS_TOP_N = 300
+config.TEST.RPN_MIN_SIZE = config.network.RPN_FEAT_STRIDE
+
+# RPN generate proposal
+config.TEST.PROPOSAL_NMS_THRESH = 0.7
+config.TEST.PROPOSAL_PRE_NMS_TOP_N = 20000
+config.TEST.PROPOSAL_POST_NMS_TOP_N = 2000
+config.TEST.PROPOSAL_MIN_SIZE = config.network.RPN_FEAT_STRIDE
+
+# RCNN nms
+config.TEST.NMS = 0.3
+
+config.TEST.max_per_image = 300
+
+# Test Model Epoch
+config.TEST.test_epoch = 0
+
+
+def update_config(config_file):
+    exp_config = None
+    with open(config_file) as f:
+        exp_config = edict(yaml.load(f))
+        for k, v in exp_config.items():
+            if k in config:
+                if isinstance(v, dict):
+                    if k == 'TRAIN':
+                        if 'BBOX_WEIGHTS' in v:
+                            v['BBOX_WEIGHTS'] = np.array(v['BBOX_WEIGHTS'])
+                    elif k == 'network':
+                        if 'PIXEL_MEANS' in v:
+                            v['PIXEL_MEANS'] = np.array(v['PIXEL_MEANS'])
+                    for vk, vv in v.items():
+                        config[k][vk] = vv
+                else:
+                    if k == 'SCALES':
+                        config[k][0] = (tuple(v))
+                    else:
+                        config[k] = v
+            else:
+                raise ValueError("key must exist in config.py")
diff --git a/faster_rcnn/core/DataParallelExecutorGroup.py b/faster_rcnn/core/DataParallelExecutorGroup.py
new file mode 100644
index 0000000..9579514
--- /dev/null
+++ b/faster_rcnn/core/DataParallelExecutorGroup.py
@@ -0,0 +1,591 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import logging
+import numpy as np
+
+from mxnet import context as ctx
+from mxnet import ndarray as nd
+from mxnet.io import DataDesc
+from mxnet.executor_manager import _split_input_slice
+
+
+
+def _load_general(data, targets, major_axis):
+    """Load a list of arrays into a list of arrays specified by slices"""
+    for d_src, d_targets in zip(data, targets):
+        if isinstance(d_targets, nd.NDArray):
+            d_src.copyto(d_targets)
+        elif isinstance(d_src, (list, tuple)):
+            for src, dst in zip(d_src, d_targets):
+                src.copyto(dst)
+        else:
+            raise NotImplementedError
+
+
+
+def _load_data(batch, targets, major_axis):
+    """Load data into sliced arrays"""
+    _load_general(batch.data, targets, major_axis)
+
+
+def _load_label(batch, targets, major_axis):
+    """Load label into sliced arrays"""
+    _load_general(batch.label, targets, major_axis)
+
+
+def _merge_multi_context(outputs, major_axis):
+    """Merge outputs that lives on multiple context into one, so that they look
+    like living on one context.
+    """
+    rets = []
+    for tensors, axis in zip(outputs, major_axis):
+        if axis >= 0:
+            rets.append(nd.concatenate(tensors, axis=axis, always_copy=False))
+        else:
+            # negative axis means the there is no batch_size axis, and all the
+            # results should be the same on each device. We simply take the
+            # first one, without checking they are actually the same
+            rets.append(tensors[0])
+    return rets
+
+
+
+class DataParallelExecutorGroup(object):
+    """DataParallelExecutorGroup is a group of executors that lives on a group of devices.
+    This is a helper class used to implement data parallelization. Each mini-batch will
+    be split and run on the devices.
+
+    Parameters
+    ----------
+    symbol : Symbol
+        The common symbolic computation graph for all executors.
+    contexts : list
+        A list of contexts.
+    workload : list
+        If not `None`, could be a list of numbers that specify the workload to be assigned
+        to different context. Larger number indicate heavier workload.
+    data_shapes : list
+        Should be a list of (name, shape) tuples, for the shapes of data. Note the order is
+        important and should be the same as the order that the `DataIter` provide the data.
+    label_shapes : list
+        Should be a list of (name, shape) tuples, for the shapes of label. Note the order is
+        important and should be the same as the order that the `DataIter` provide the label.
+    param_names : list
+        A list of strings, indicating the names of parameters (e.g. weights, filters, etc.)
+        in the computation graph.
+    for_training : bool
+        Indicate whether the executors should be bind for training. When not doing training,
+        the memory for gradients will not be allocated.
+    inputs_need_grad : bool
+        Indicate whether the gradients for the input data should be computed. This is currently
+        not used. It will be useful for implementing composition of modules.
+    shared_group : DataParallelExecutorGroup
+        Default is `None`. This is used in bucketing. When not `None`, it should be a executor
+        group corresponding to a different bucket. In other words, it will correspond to a different
+        symbol but with the same set of parameters (e.g. unrolled RNNs with different lengths).
+        In this case, many memory will be shared.
+    logger : Logger
+        Default is `logging`.
+    fixed_param_names: list of str
+        Indicate parameters to be fixed during training. Parameters in this list will not allocate
+        space for gradient, nor do gradient calculation.
+    grad_req : str, list of str, dict of str to str
+        Requirement for gradient accumulation. Can be 'write', 'add', or 'null'
+        (default to 'write').
+        Can be specified globally (str) or for each argument (list, dict).
+    """
+    def __init__(self, symbol, contexts, workload, data_shapes, label_shapes, param_names,
+                 for_training, inputs_need_grad, shared_group=None, logger=logging,
+                 fixed_param_names=None, grad_req='write', state_names=None):
+        self.param_names = param_names
+        self.arg_names = symbol.list_arguments()
+        self.aux_names = symbol.list_auxiliary_states()
+
+        self.symbol = symbol
+        self.contexts = contexts
+        self.workload = workload
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+
+        self.logger = logger
+        #In the future we should have a better way to profile memory per device (haibin)
+        # self._total_exec_bytes = 0
+        self.fixed_param_names = fixed_param_names
+        if self.fixed_param_names is None:
+            self.fixed_param_names = []
+
+        self.state_names = state_names
+        if self.state_names is None:
+            self.state_names = []
+
+        if not for_training:
+            grad_req = 'null'
+
+        # data_shapes = [x if isinstance(x, DataDesc) else DataDesc(*x) for x in data_shapes]
+        # if label_shapes is not None:
+        #     label_shapes = [x if isinstance(x, DataDesc) else DataDesc(*x) for x in label_shapes]
+
+        data_names = [x.name for x in data_shapes[0]]
+
+        if isinstance(grad_req, str):
+            self.grad_req = {}
+            for k in self.arg_names:
+                if k in self.param_names:
+                    self.grad_req[k] = 'null' if k in self.fixed_param_names else grad_req
+                elif k in data_names:
+                    self.grad_req[k] = grad_req if self.inputs_need_grad else 'null'
+                else:
+                    self.grad_req[k] = 'null'
+        elif isinstance(grad_req, (list, tuple)):
+            assert len(grad_req) == len(self.arg_names)
+            self.grad_req = dict(zip(self.arg_names, grad_req))
+        elif isinstance(grad_req, dict):
+            self.grad_req = {}
+            for k in self.arg_names:
+                if k in self.param_names:
+                    self.grad_req[k] = 'null' if k in self.fixed_param_names else 'write'
+                elif k in data_names:
+                    self.grad_req[k] = 'write' if self.inputs_need_grad else 'null'
+                else:
+                    self.grad_req[k] = 'null'
+            self.grad_req.update(grad_req)
+        else:
+            raise ValueError("grad_req must be one of str, list, tuple, or dict.")
+
+        if shared_group is not None:
+            self.shared_data_arrays = shared_group.shared_data_arrays
+        else:
+            self.shared_data_arrays = [{} for _ in contexts]
+
+        # initialize some instance variables
+        self.batch_size = len(data_shapes)
+        self.slices = None
+        self.execs = []
+        self._default_execs = None
+        self.data_arrays = None
+        self.label_arrays = None
+        self.param_arrays = None
+        self.state_arrays = None
+        self.grad_arrays = None
+        self.aux_arrays = None
+        self.input_grad_arrays = None
+
+        self.data_shapes = None
+        self.label_shapes = None
+        self.data_layouts = None
+        self.label_layouts = None
+        self.output_layouts = [DataDesc.get_batch_axis(self.symbol[name].attr('__layout__'))
+                               for name in self.symbol.list_outputs()]
+        self.bind_exec(data_shapes, label_shapes, shared_group)
+
+    def decide_slices(self, data_shapes):
+        """Decide the slices for each context according to the workload.
+
+        Parameters
+        ----------
+        data_shapes : list
+            list of (name, shape) specifying the shapes for the input data or label.
+        """
+        assert len(data_shapes) > 0
+        major_axis = [DataDesc.get_batch_axis(x.layout) for x in data_shapes]
+
+        for (name, shape), axis in zip(data_shapes, major_axis):
+            if axis == -1:
+                continue
+
+            batch_size = shape[axis]
+            if self.batch_size is not None:
+                assert batch_size == self.batch_size, ("all data must have the same batch size: "
+                                                       + ("batch_size = %d, but " % self.batch_size)
+                                                       + ("%s has shape %s" % (name, shape)))
+            else:
+                self.batch_size = batch_size
+                self.slices = _split_input_slice(self.batch_size, self.workload)
+
+        return major_axis
+
+    def _collect_arrays(self):
+        """Collect internal arrays from executors."""
+        # convenient data structures
+        self.data_arrays = [[e.arg_dict[name] for name, _ in self.data_shapes[0]] for e in self.execs]
+
+        self.state_arrays = [[e.arg_dict[name] for e in self.execs]
+                             for name in self.state_names]
+
+        if self.label_shapes is not None:
+            self.label_arrays = [[e.arg_dict[name] for name, _ in self.label_shapes[0]] for e in self.execs]
+        else:
+            self.label_arrays = None
+
+        self.param_arrays = [[exec_.arg_arrays[i] for exec_ in self.execs]
+                             for i, name in enumerate(self.arg_names)
+                             if name in self.param_names]
+        if self.for_training:
+            self.grad_arrays = [[exec_.grad_arrays[i] for exec_ in self.execs]
+                                for i, name in enumerate(self.arg_names)
+                                if name in self.param_names]
+        else:
+            self.grad_arrays = None
+
+        data_names = [x[0] for x in self.data_shapes]
+        if self.inputs_need_grad:
+            self.input_grad_arrays = [[exec_.grad_arrays[i] for exec_ in self.execs]
+                                      for i, name in enumerate(self.arg_names)
+                                      if name in data_names]
+        else:
+            self.input_grad_arrays = None
+
+        self.aux_arrays = [[exec_.aux_arrays[i] for exec_ in self.execs]
+                           for i in range(len(self.aux_names))]
+
+    def bind_exec(self, data_shapes, label_shapes, shared_group=None, reshape=False):
+        """Bind executors on their respective devices.
+
+        Parameters
+        ----------
+        data_shapes : list
+        label_shapes : list
+        shared_group : DataParallelExecutorGroup
+        reshape : bool
+        """
+        assert reshape or not self.execs
+
+        for i in range(len(self.contexts)):
+            data_shapes_i = data_shapes[i]
+            if label_shapes is not None:
+                label_shapes_i = label_shapes[i]
+            else:
+                label_shapes_i = []
+
+            if reshape:
+                self.execs[i] = self._default_execs[i].reshape(
+                    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
+            else:
+                self.execs.append(self._bind_ith_exec(i, data_shapes_i, label_shapes_i,
+                                                      shared_group))
+
+        self.data_shapes = data_shapes
+        self.label_shapes = label_shapes
+        self._collect_arrays()
+
+    def reshape(self, data_shapes, label_shapes):
+        """Reshape executors.
+
+        Parameters
+        ----------
+        data_shapes : list
+        label_shapes : list
+        """
+        if self._default_execs is None:
+            self._default_execs = [i for i in self.execs]
+        for i in range(len(self.contexts)):
+            self.execs[i] = self._default_execs[i].reshape(
+                allow_up_sizing=True, **dict(data_shapes[i] + (label_shapes[i] if label_shapes is not None else []))
+            )
+        self.data_shapes = data_shapes
+        self.label_shapes = label_shapes
+        self._collect_arrays()
+
+
+    def set_params(self, arg_params, aux_params):
+        """Assign, i.e. copy parameters to all the executors.
+
+        Parameters
+        ----------
+        arg_params : dict
+            A dictionary of name to `NDArray` parameter mapping.
+        aux_params : dict
+            A dictionary of name to `NDArray` auxiliary variable mapping.
+        """
+        for exec_ in self.execs:
+            exec_.copy_params_from(arg_params, aux_params)
+
+    def get_params(self, arg_params, aux_params):
+        """ Copy data from each executor to `arg_params` and `aux_params`.
+
+        Parameters
+        ----------
+        arg_params : list of NDArray
+            target parameter arrays
+        aux_params : list of NDArray
+            target aux arrays
+
+        Notes
+        -----
+        - This function will inplace update the NDArrays in arg_params and aux_params.
+        """
+        for name, block in zip(self.param_names, self.param_arrays):
+            weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block)
+            weight.astype(arg_params[name].dtype).copyto(arg_params[name])
+        for name, block in zip(self.aux_names, self.aux_arrays):
+            weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block)
+            weight.astype(aux_params[name].dtype).copyto(aux_params[name])
+
+    def forward(self, data_batch, is_train=None):
+        """Split `data_batch` according to workload and run forward on each devices.
+
+        Parameters
+        ----------
+        data_batch : DataBatch
+            Or could be any object implementing similar interface.
+        is_train : bool
+            The hint for the backend, indicating whether we are during training phase.
+            Default is `None`, then the value `self.for_training` will be used.
+        Returns
+        -------
+
+        """
+        _load_data(data_batch, self.data_arrays, self.data_layouts)
+        if is_train is None:
+            is_train = self.for_training
+
+        if self.label_arrays is not None:
+            assert not is_train or data_batch.label
+            if data_batch.label:
+                _load_label(data_batch, self.label_arrays, self.label_layouts)
+
+        for exec_ in self.execs:
+            exec_.forward(is_train=is_train)
+
+
+    def get_outputs(self, merge_multi_context=True):
+        """Get outputs of the previous forward computation.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        outputs = [[exec_.outputs[i] for exec_ in self.execs]
+                   for i in range(len(self.execs[0].outputs))]
+        if merge_multi_context:
+            outputs = _merge_multi_context(outputs, self.output_layouts)
+        return outputs
+
+    def get_states(self, merge_multi_context=True):
+        """Get states from all devices
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the states
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert not merge_multi_context, \
+            "merge_multi_context=True is not supported for get_states yet."
+        return self.state_arrays
+
+    def set_states(self, states=None, value=None):
+        """Set value for states. Only one of states & value can be specified.
+
+        Parameters
+        ----------
+        states : list of list of NDArrays
+            source states arrays formatted like [[state1_dev1, state1_dev2],
+            [state2_dev1, state2_dev2]].
+        value : number
+            a single scalar value for all state arrays.
+        """
+        if states is not None:
+            assert value is None, "Only one of states & value can be specified."
+            _load_general(states, self.state_arrays, (0,)*len(states))
+        else:
+            assert value is not None, "At least one of states & value must be specified."
+            assert states is None, "Only one of states & value can be specified."
+            for d_dst in self.state_arrays:
+                for dst in d_dst:
+                    dst[:] = value
+
+    def get_input_grads(self, merge_multi_context=True):
+        """Get the gradients with respect to the inputs of the module.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[grad1, grad2]`. Otherwise, it
+        is like `[[grad1_dev1, grad1_dev2], [grad2_dev1, grad2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.inputs_need_grad
+        if merge_multi_context:
+            return _merge_multi_context(self.input_grad_arrays, self.data_layouts)
+        return self.input_grad_arrays
+
+    def backward(self, out_grads=None):
+        """Run backward on all devices. A backward should be called after
+        a call to the forward function. Backward cannot be called unless
+        `self.for_training` is `True`.
+
+        Parameters
+        ----------
+        out_grads : NDArray or list of NDArray, optional
+            Gradient on the outputs to be propagated back.
+            This parameter is only needed when bind is called
+            on outputs that are not a loss function.
+        """
+        assert self.for_training, 're-bind with for_training=True to run backward'
+        if out_grads is None:
+            out_grads = []
+
+        for i, exec_ in enumerate(self.execs):
+            out_grads_slice = []
+            exec_.backward(out_grads=out_grads_slice)
+
+    def update_metric(self, eval_metric, labels):
+        """Accumulate the performance according to `eval_metric` on all devices.
+
+        Parameters
+        ----------
+        eval_metric : EvalMetric
+            The metric used for evaluation.
+        labels : list of NDArray
+            Typically comes from `label` of a `DataBatch`.
+        """
+        for texec, labels in zip(self.execs, labels):
+            eval_metric.update(labels, texec.outputs)
+
+    def _bind_ith_exec(self, i, data_shapes, label_shapes, shared_group):
+        """Internal utility function to bind the i-th executor.
+        """
+        shared_exec = None if shared_group is None else shared_group.execs[i]
+        context = self.contexts[i]
+        shared_data_arrays = self.shared_data_arrays[i]
+
+        input_shapes = dict(data_shapes)
+        if label_shapes is not None:
+            input_shapes.update(dict(label_shapes))
+
+        arg_shapes, _, aux_shapes = self.symbol.infer_shape(**input_shapes)
+        assert arg_shapes is not None, "shape inference failed"
+
+        input_types = {x.name: x.dtype for x in data_shapes}
+        if label_shapes is not None:
+            input_types.update({x.name: x.dtype for x in label_shapes})
+        arg_types, _, aux_types = self.symbol.infer_type(**input_types)
+        assert arg_types is not None, "type inference failed"
+
+        arg_arrays = []
+        grad_arrays = {} if self.for_training else None
+
+        def _get_or_reshape(name, shared_data_arrays, arg_shape, arg_type, context, logger):
+            """Internal helper to get a memory block or re-use by re-shaping"""
+            if name in shared_data_arrays:
+                arg_arr = shared_data_arrays[name]
+
+                if np.prod(arg_arr.shape) >= np.prod(arg_shape):
+                    # nice, we can directly re-use this data blob
+                    assert arg_arr.dtype == arg_type
+                    arg_arr = arg_arr.reshape(arg_shape)
+                else:
+                    logger.warning(('bucketing: data "%s" has a shape %s' % (name, arg_shape)) +
+                                   (', which is larger than already allocated ') +
+                                   ('shape %s' % (arg_arr.shape,)) +
+                                   ('. Need to re-allocate. Consider putting ') +
+                                   ('default_bucket_key to') +
+                                   (' be the bucket taking the largest input for better ') +
+                                   ('memory sharing.'))
+                    arg_arr = nd.zeros(arg_shape, context, dtype=arg_type)
+
+                    # replace existing shared array because the new one is bigger
+                    shared_data_arrays[name] = arg_arr
+            else:
+                arg_arr = nd.zeros(arg_shape, context, dtype=arg_type)
+                shared_data_arrays[name] = arg_arr
+
+            return arg_arr
+
+        # create or borrow arguments and gradients
+        for j in range(len(self.arg_names)):
+            name = self.arg_names[j]
+            if name in self.param_names: # model parameters
+                if shared_exec is None:
+                    arg_arr = nd.zeros(arg_shapes[j], context, dtype=arg_types[j])
+                    if self.grad_req[name] != 'null':
+                        grad_arr = nd.zeros(arg_shapes[j], context, dtype=arg_types[j])
+                        grad_arrays[name] = grad_arr
+                else:
+                    arg_arr = shared_exec.arg_dict[name]
+                    assert arg_arr.shape == arg_shapes[j]
+                    assert arg_arr.dtype == arg_types[j]
+                    if self.grad_req[name] != 'null':
+                        grad_arrays[name] = shared_exec.grad_dict[name]
+            else: # data, label, or states
+                arg_arr = _get_or_reshape(name, shared_data_arrays, arg_shapes[j], arg_types[j],
+                                          context, self.logger)
+
+                # data might also need grad if inputs_need_grad is True
+                if self.grad_req[name] != 'null':
+                    grad_arrays[name] = _get_or_reshape('grad of ' + name, shared_data_arrays,
+                                                        arg_shapes[j], arg_types[j], context,
+                                                        self.logger)
+
+            arg_arrays.append(arg_arr)
+
+        # create or borrow aux variables
+        if shared_exec is None:
+            aux_arrays = [nd.zeros(s, context, dtype=t) for s, t in zip(aux_shapes, aux_types)]
+        else:
+            for j, arr in enumerate(shared_exec.aux_arrays):
+                assert aux_shapes[j] == arr.shape
+                assert aux_types[j] == arr.dtype
+            aux_arrays = shared_exec.aux_arrays[:]
+
+        executor = self.symbol.bind(ctx=context, args=arg_arrays,
+                                    args_grad=grad_arrays, aux_states=aux_arrays,
+                                    grad_req=self.grad_req, shared_exec=shared_exec)
+        # Get the total bytes allocated for this executor
+        return executor
+
+    def _sliced_shape(self, shapes, i, major_axis):
+        """Get the sliced shapes for the i-th executor.
+
+        Parameters
+        ----------
+        shapes : list of (str, tuple)
+            The original (name, shape) pairs.
+        i : int
+            Which executor we are dealing with.
+        """
+        sliced_shapes = []
+        for desc, axis in zip(shapes, major_axis):
+            shape = list(desc.shape)
+            if axis >= 0:
+                shape[axis] = self.slices[i].stop - self.slices[i].start
+            sliced_shapes.append(DataDesc(desc.name, tuple(shape), desc.dtype, desc.layout))
+        return sliced_shapes
+
+    def install_monitor(self, mon):
+        """Install monitor on all executors"""
+        for exe in self.execs:
+            mon.install(exe)
diff --git a/faster_rcnn/core/__init__.py b/faster_rcnn/core/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/faster_rcnn/core/callback.py b/faster_rcnn/core/callback.py
new file mode 100644
index 0000000..4286f43
--- /dev/null
+++ b/faster_rcnn/core/callback.py
@@ -0,0 +1,56 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import time
+import logging
+import mxnet as mx
+
+
+class Speedometer(object):
+    def __init__(self, batch_size, frequent=50):
+        self.batch_size = batch_size
+        self.frequent = frequent
+        self.init = False
+        self.tic = 0
+        self.last_count = 0
+
+    def __call__(self, param):
+        """Callback to Show speed."""
+        count = param.nbatch
+        if self.last_count > count:
+            self.init = False
+        self.last_count = count
+
+        if self.init:
+            if count % self.frequent == 0:
+                speed = self.frequent * self.batch_size / (time.time() - self.tic)
+                s = ''
+                if param.eval_metric is not None:
+                    name, value = param.eval_metric.get()
+                    s = "Epoch[%d] Batch [%d]\tSpeed: %.2f samples/sec\tTrain-" % (param.epoch, count, speed)
+                    for n, v in zip(name, value):
+                        s += "%s=%f,\t" % (n, v)
+                else:
+                    s = "Iter[%d] Batch [%d]\tSpeed: %.2f samples/sec" % (param.epoch, count, speed)
+
+                logging.info(s)
+                print(s)
+                self.tic = time.time()
+        else:
+            self.init = True
+            self.tic = time.time()
+
+
+def do_checkpoint(prefix, means, stds):
+    def _callback(iter_no, sym, arg, aux):
+        arg['bbox_pred_weight_test'] = (arg['bbox_pred_weight'].T * mx.nd.array(stds)).T
+        arg['bbox_pred_bias_test'] = arg['bbox_pred_bias'] * mx.nd.array(stds) + mx.nd.array(means)
+        mx.model.save_checkpoint(prefix, iter_no + 1, sym, arg, aux)
+        arg.pop('bbox_pred_weight_test')
+        arg.pop('bbox_pred_bias_test')
+    return _callback
diff --git a/faster_rcnn/core/loader.py b/faster_rcnn/core/loader.py
new file mode 100644
index 0000000..78de81b
--- /dev/null
+++ b/faster_rcnn/core/loader.py
@@ -0,0 +1,506 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import numpy as np
+import mxnet as mx
+from mxnet.executor_manager import _split_input_slice
+
+from config.config import config
+from utils.image import tensor_vstack
+from rpn.rpn import get_rpn_testbatch, get_rpn_batch, assign_anchor
+from rcnn import get_rcnn_testbatch, get_rcnn_batch
+
+
+class TestLoader(mx.io.DataIter):
+    def __init__(self, roidb, config, batch_size=1, shuffle=False,
+                 has_rpn=False):
+        super(TestLoader, self).__init__()
+
+        # save parameters as properties
+        self.cfg = config
+        self.roidb = roidb
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.has_rpn = has_rpn
+
+        # infer properties from roidb
+        self.size = len(self.roidb)
+        self.index = np.arange(self.size)
+
+        # decide data and label names (only for training)
+        if has_rpn:
+            self.data_name = ['data', 'im_info']
+        else:
+            self.data_name = ['data', 'rois']
+        self.label_name = None
+
+        # status variable for synchronization between get_data and get_label
+        self.cur = 0
+        self.data = None
+        self.label = []
+        self.im_info = None
+
+        # get first batch to fill in provide_data and provide_label
+        self.reset()
+        self.get_batch()
+
+    @property
+    def provide_data(self):
+        return [[(k, v.shape) for k, v in zip(self.data_name, idata)] for idata in self.data]
+
+    @property
+    def provide_label(self):
+        return [None for _ in range(len(self.data))]
+
+    @property
+    def provide_data_single(self):
+        return [(k, v.shape) for k, v in zip(self.data_name, self.data[0])]
+
+    @property
+    def provide_label_single(self):
+        return None
+
+    def reset(self):
+        self.cur = 0
+        if self.shuffle:
+            np.random.shuffle(self.index)
+
+    def iter_next(self):
+        return self.cur < self.size
+
+    def next(self):
+        if self.iter_next():
+            self.get_batch()
+            self.cur += self.batch_size
+            return self.im_info, mx.io.DataBatch(data=self.data, label=self.label,
+                                   pad=self.getpad(), index=self.getindex(),
+                                   provide_data=self.provide_data, provide_label=self.provide_label)
+        else:
+            raise StopIteration
+
+    def getindex(self):
+        return self.cur / self.batch_size
+
+    def getpad(self):
+        if self.cur + self.batch_size > self.size:
+            return self.cur + self.batch_size - self.size
+        else:
+            return 0
+
+    def get_batch(self):
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+        if self.has_rpn:
+            data, label, im_info = get_rpn_testbatch(roidb, self.cfg)
+        else:
+            data, label, im_info = get_rcnn_testbatch(roidb, self.cfg)
+        self.data = [[mx.nd.array(idata[name]) for name in self.data_name] for idata in data]
+        self.im_info = im_info
+
+    def get_batch_individual(self):
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+        if self.has_rpn:
+            data, label, im_info = get_rpn_testbatch(roidb, self.cfg)
+        else:
+            data, label, im_info = get_rcnn_testbatch(roidb, self.cfg)
+        self.data = [mx.nd.array(data[name]) for name in self.data_name]
+        self.im_info = im_info
+
+
+class ROIIter(mx.io.DataIter):
+    def __init__(self, roidb, config, batch_size=2, shuffle=False, ctx=None, work_load_list=None, aspect_grouping=False):
+        """
+        This Iter will provide roi data to Fast R-CNN network
+        :param roidb: must be preprocessed
+        :param batch_size: must divide BATCH_SIZE(128)
+        :param shuffle: bool
+        :param ctx: list of contexts
+        :param work_load_list: list of work load
+        :param aspect_grouping: group images with similar aspects
+        :return: ROIIter
+        """
+        super(ROIIter, self).__init__()
+
+        # save parameters as properties
+        self.roidb = roidb
+        self.cfg = config
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.ctx = ctx
+        if self.ctx is None:
+            self.ctx = [mx.cpu()]
+        self.work_load_list = work_load_list
+        self.aspect_grouping = aspect_grouping
+
+        # infer properties from roidb
+        self.size = len(roidb)
+        self.index = np.arange(self.size)
+
+        # decide data and label names (only for training)
+        self.data_name = ['data', 'rois']
+        self.label_name = ['label', 'bbox_target', 'bbox_weight']
+
+        # status variable for synchronization between get_data and get_label
+        self.cur = 0
+        self.batch = None
+        self.data = None
+        self.label = None
+
+        # get first batch to fill in provide_data and provide_label
+        self.reset()
+        self.get_batch_individual()
+
+    @property
+    def provide_data(self):
+        return [[(k, v.shape) for k, v in zip(self.data_name, self.data[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_label(self):
+        return [[(k, v.shape) for k, v in zip(self.label_name, self.label[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_data_single(self):
+        return [(k, v.shape) for k, v in zip(self.data_name, self.data[0])]
+
+    @property
+    def provide_label_single(self):
+        return [(k, v.shape) for k, v in zip(self.label_name, self.label[0])]
+
+    def reset(self):
+        self.cur = 0
+        if self.shuffle:
+            if self.aspect_grouping:
+                widths = np.array([r['width'] for r in self.roidb])
+                heights = np.array([r['height'] for r in self.roidb])
+                horz = (widths >= heights)
+                vert = np.logical_not(horz)
+                horz_inds = np.where(horz)[0]
+                vert_inds = np.where(vert)[0]
+                inds = np.hstack((np.random.permutation(horz_inds), np.random.permutation(vert_inds)))
+                extra = inds.shape[0] % self.batch_size
+                inds_ = np.reshape(inds[:-extra], (-1, self.batch_size))
+                row_perm = np.random.permutation(np.arange(inds_.shape[0]))
+                inds[:-extra] = np.reshape(inds_[row_perm, :], (-1,))
+                self.index = inds
+            else:
+                np.random.shuffle(self.index)
+
+    def iter_next(self):
+        return self.cur + self.batch_size <= self.size
+
+    def next(self):
+        if self.iter_next():
+            self.get_batch_individual()
+            self.cur += self.batch_size
+            return mx.io.DataBatch(data=self.data, label=self.label,
+                                   pad=self.getpad(), index=self.getindex(),
+                                   provide_data=self.provide_data, provide_label=self.provide_label)
+        else:
+            raise StopIteration
+
+    def getindex(self):
+        return self.cur / self.batch_size
+
+    def getpad(self):
+        if self.cur + self.batch_size > self.size:
+            return self.cur + self.batch_size - self.size
+        else:
+            return 0
+
+    def get_batch(self):
+        # slice roidb
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+
+        # decide multi device slices
+        work_load_list = self.work_load_list
+        ctx = self.ctx
+        if work_load_list is None:
+            work_load_list = [1] * len(ctx)
+        assert isinstance(work_load_list, list) and len(work_load_list) == len(ctx), \
+            "Invalid settings for work load. "
+        slices = _split_input_slice(self.batch_size, work_load_list)
+
+        # get each device
+        data_list = []
+        label_list = []
+        for islice in slices:
+            iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
+            data, label = get_rcnn_batch(iroidb, self.cfg)
+            data_list.append(data)
+            label_list.append(label)
+
+        all_data = dict()
+        for key in data_list[0].keys():
+            all_data[key] = tensor_vstack([batch[key] for batch in data_list])
+
+        all_label = dict()
+        for key in label_list[0].keys():
+            all_label[key] = tensor_vstack([batch[key] for batch in label_list])
+
+        self.data = [mx.nd.array(all_data[name]) for name in self.data_name]
+        self.label = [mx.nd.array(all_label[name]) for name in self.label_name]
+
+    def get_batch_individual(self):
+        # slice roidb
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+
+        # decide multi device slices
+        work_load_list = self.work_load_list
+        ctx = self.ctx
+        if work_load_list is None:
+            work_load_list = [1] * len(ctx)
+        assert isinstance(work_load_list, list) and len(work_load_list) == len(ctx), \
+            "Invalid settings for work load. "
+        slices = _split_input_slice(self.batch_size, work_load_list)
+
+        rst = []
+        for idx, islice in enumerate(slices):
+            iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
+            rst.append(self.parfetch(iroidb))
+
+        all_data = [_['data'] for _ in rst]
+        all_label = [_['label'] for _ in rst]
+        self.data = [[mx.nd.array(data[key]) for key in self.data_name] for data in all_data]
+        self.label = [[mx.nd.array(label[key]) for key in self.label_name] for label in all_label]
+
+    def parfetch(self, iroidb):
+        data, label = get_rcnn_batch(iroidb, self.cfg)
+        return {'data': data, 'label': label}
+
+
+class AnchorLoader(mx.io.DataIter):
+
+    def __init__(self, feat_sym, roidb, cfg, batch_size=1, shuffle=False, ctx=None, work_load_list=None,
+                 feat_stride=16, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2), allowed_border=0,
+                 aspect_grouping=False):
+        """
+        This Iter will provide roi data to Fast R-CNN network
+        :param feat_sym: to infer shape of assign_output
+        :param roidb: must be preprocessed
+        :param batch_size: must divide BATCH_SIZE(128)
+        :param shuffle: bool
+        :param ctx: list of contexts
+        :param work_load_list: list of work load
+        :param aspect_grouping: group images with similar aspects
+        :return: AnchorLoader
+        """
+        super(AnchorLoader, self).__init__()
+
+        # save parameters as properties
+        self.feat_sym = feat_sym
+        self.roidb = roidb
+        self.cfg = cfg
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.ctx = ctx
+        if self.ctx is None:
+            self.ctx = [mx.cpu()]
+        self.work_load_list = work_load_list
+        self.feat_stride = feat_stride
+        self.anchor_scales = anchor_scales
+        self.anchor_ratios = anchor_ratios
+        self.allowed_border = allowed_border
+        self.aspect_grouping = aspect_grouping
+
+        # infer properties from roidb
+        self.size = len(roidb)
+        self.index = np.arange(self.size)
+
+        # decide data and label names
+        if config.TRAIN.END2END:
+            self.data_name = ['data', 'im_info', 'gt_boxes']
+        else:
+            self.data_name = ['data']
+        self.label_name = ['label', 'bbox_target', 'bbox_weight']
+
+        # status variable for synchronization between get_data and get_label
+        self.cur = 0
+        self.batch = None
+        self.data = None
+        self.label = None
+
+        # get first batch to fill in provide_data and provide_label
+        self.reset()
+        self.get_batch_individual()
+
+    @property
+    def provide_data(self):
+        return [[(k, v.shape) for k, v in zip(self.data_name, self.data[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_label(self):
+        return [[(k, v.shape) for k, v in zip(self.label_name, self.label[i])] for i in xrange(len(self.data))]
+
+    @property
+    def provide_data_single(self):
+        return [(k, v.shape) for k, v in zip(self.data_name, self.data[0])]
+
+    @property
+    def provide_label_single(self):
+        return [(k, v.shape) for k, v in zip(self.label_name, self.label[0])]
+
+    def reset(self):
+        self.cur = 0
+        if self.shuffle:
+            if self.aspect_grouping:
+                widths = np.array([r['width'] for r in self.roidb])
+                heights = np.array([r['height'] for r in self.roidb])
+                horz = (widths >= heights)
+                vert = np.logical_not(horz)
+                horz_inds = np.where(horz)[0]
+                vert_inds = np.where(vert)[0]
+                inds = np.hstack((np.random.permutation(horz_inds), np.random.permutation(vert_inds)))
+                extra = inds.shape[0] % self.batch_size
+                inds_ = np.reshape(inds[:-extra], (-1, self.batch_size))
+                row_perm = np.random.permutation(np.arange(inds_.shape[0]))
+                inds[:-extra] = np.reshape(inds_[row_perm, :], (-1,))
+                self.index = inds
+            else:
+                np.random.shuffle(self.index)
+
+    def iter_next(self):
+        return self.cur + self.batch_size <= self.size
+
+    def next(self):
+        if self.iter_next():
+            self.get_batch_individual()
+            self.cur += self.batch_size
+            return mx.io.DataBatch(data=self.data, label=self.label,
+                                   pad=self.getpad(), index=self.getindex(),
+                                   provide_data=self.provide_data, provide_label=self.provide_label)
+        else:
+            raise StopIteration
+
+    def getindex(self):
+        return self.cur / self.batch_size
+
+    def getpad(self):
+        if self.cur + self.batch_size > self.size:
+            return self.cur + self.batch_size - self.size
+        else:
+            return 0
+
+    def infer_shape(self, max_data_shape=None, max_label_shape=None):
+        """ Return maximum data and label shape for single gpu """
+        if max_data_shape is None:
+            max_data_shape = []
+        if max_label_shape is None:
+            max_label_shape = []
+        max_shapes = dict(max_data_shape + max_label_shape)
+        input_batch_size = max_shapes['data'][0]
+        im_info = [[max_shapes['data'][2], max_shapes['data'][3], 1.0]]
+        _, feat_shape, _ = self.feat_sym.infer_shape(**max_shapes)
+        label = assign_anchor(feat_shape[0], np.zeros((0, 5)), im_info, self.cfg,
+                              self.feat_stride, self.anchor_scales, self.anchor_ratios, self.allowed_border)
+        label = [label[k] for k in self.label_name]
+        label_shape = [(k, tuple([input_batch_size] + list(v.shape[1:]))) for k, v in zip(self.label_name, label)]
+        return max_data_shape, label_shape
+
+    def get_batch(self):
+        # slice roidb
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+
+        # decide multi device slice
+        work_load_list = self.work_load_list
+        ctx = self.ctx
+        if work_load_list is None:
+            work_load_list = [1] * len(ctx)
+        assert isinstance(work_load_list, list) and len(work_load_list) == len(ctx), \
+            "Invalid settings for work load. "
+        slices = _split_input_slice(self.batch_size, work_load_list)
+
+        # get testing data for multigpu
+        data_list = []
+        label_list = []
+        for islice in slices:
+            iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
+            data, label = get_rpn_batch(iroidb, self.cfg)
+            data_list.append(data)
+            label_list.append(label)
+
+        # pad data first and then assign anchor (read label)
+        data_tensor = tensor_vstack([batch['data'] for batch in data_list])
+        for data, data_pad in zip(data_list, data_tensor):
+            data['data'] = data_pad[np.newaxis, :]
+
+        new_label_list = []
+        for data, label in zip(data_list, label_list):
+            # infer label shape
+            data_shape = {k: v.shape for k, v in data.items()}
+            del data_shape['im_info']
+            _, feat_shape, _ = self.feat_sym.infer_shape(**data_shape)
+            feat_shape = [int(i) for i in feat_shape[0]]
+
+            # add gt_boxes to data for e2e
+            data['gt_boxes'] = label['gt_boxes'][np.newaxis, :, :]
+
+            # assign anchor for label
+            label = assign_anchor(feat_shape, label['gt_boxes'], data['im_info'], self.cfg,
+                                  self.feat_stride, self.anchor_scales,
+                                  self.anchor_ratios, self.allowed_border)
+            new_label_list.append(label)
+
+        all_data = dict()
+        for key in self.data_name:
+            all_data[key] = tensor_vstack([batch[key] for batch in data_list])
+
+        all_label = dict()
+        for key in self.label_name:
+            pad = -1 if key == 'label' else 0
+            all_label[key] = tensor_vstack([batch[key] for batch in new_label_list], pad=pad)
+
+        self.data = [mx.nd.array(all_data[key]) for key in self.data_name]
+        self.label = [mx.nd.array(all_label[key]) for key in self.label_name]
+
+    def get_batch_individual(self):
+        cur_from = self.cur
+        cur_to = min(cur_from + self.batch_size, self.size)
+        roidb = [self.roidb[self.index[i]] for i in range(cur_from, cur_to)]
+        # decide multi device slice
+        work_load_list = self.work_load_list
+        ctx = self.ctx
+        if work_load_list is None:
+            work_load_list = [1] * len(ctx)
+        assert isinstance(work_load_list, list) and len(work_load_list) == len(ctx), \
+            "Invalid settings for work load. "
+        slices = _split_input_slice(self.batch_size, work_load_list)
+        rst = []
+        for idx, islice in enumerate(slices):
+            iroidb = [roidb[i] for i in range(islice.start, islice.stop)]
+            rst.append(self.parfetch(iroidb))
+        all_data = [_['data'] for _ in rst]
+        all_label = [_['label'] for _ in rst]
+        self.data = [[mx.nd.array(data[key]) for key in self.data_name] for data in all_data]
+        self.label = [[mx.nd.array(label[key]) for key in self.label_name] for label in all_label]
+
+    def parfetch(self, iroidb):
+        # get testing data for multigpu
+        data, label = get_rpn_batch(iroidb, self.cfg)
+        data_shape = {k: v.shape for k, v in data.items()}
+        del data_shape['im_info']
+        _, feat_shape, _ = self.feat_sym.infer_shape(**data_shape)
+        feat_shape = [int(i) for i in feat_shape[0]]
+
+        # add gt_boxes to data for e2e
+        data['gt_boxes'] = label['gt_boxes'][np.newaxis, :, :]
+
+        # assign anchor for label
+        label = assign_anchor(feat_shape, label['gt_boxes'], data['im_info'], self.cfg,
+                              self.feat_stride, self.anchor_scales,
+                              self.anchor_ratios, self.allowed_border)
+        return {'data': data, 'label': label}
+
diff --git a/faster_rcnn/core/metric.py b/faster_rcnn/core/metric.py
new file mode 100644
index 0000000..52f885b
--- /dev/null
+++ b/faster_rcnn/core/metric.py
@@ -0,0 +1,176 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import mxnet as mx
+import numpy as np
+
+
+def get_rpn_names():
+    pred = ['rpn_cls_prob', 'rpn_bbox_loss']
+    label = ['rpn_label', 'rpn_bbox_target', 'rpn_bbox_weight']
+    return pred, label
+
+
+def get_rcnn_names(cfg):
+    pred = ['rcnn_cls_prob', 'rcnn_bbox_loss']
+    label = ['rcnn_label', 'rcnn_bbox_target', 'rcnn_bbox_weight']
+    if cfg.TRAIN.ENABLE_OHEM or cfg.TRAIN.END2END:
+        pred.append('rcnn_label')
+    if cfg.TRAIN.END2END:
+        rpn_pred, rpn_label = get_rpn_names()
+        pred = rpn_pred + pred
+        label = rpn_label
+    return pred, label
+
+
+class RPNAccMetric(mx.metric.EvalMetric):
+    def __init__(self):
+        super(RPNAccMetric, self).__init__('RPNAcc')
+        self.pred, self.label = get_rpn_names()
+
+    def update(self, labels, preds):
+        pred = preds[self.pred.index('rpn_cls_prob')]
+        label = labels[self.label.index('rpn_label')]
+
+        # pred (b, c, p) or (b, c, h, w)
+        pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
+        pred_label = pred_label.reshape((pred_label.shape[0], -1))
+        # label (b, p)
+        label = label.asnumpy().astype('int32')
+
+        # filter with keep_inds
+        keep_inds = np.where(label != -1)
+        pred_label = pred_label[keep_inds]
+        label = label[keep_inds]
+
+        self.sum_metric += np.sum(pred_label.flat == label.flat)
+        self.num_inst += len(pred_label.flat)
+
+
+class RCNNAccMetric(mx.metric.EvalMetric):
+    def __init__(self, cfg):
+        super(RCNNAccMetric, self).__init__('RCNNAcc')
+        self.e2e = cfg.TRAIN.END2END
+        self.ohem = cfg.TRAIN.ENABLE_OHEM
+        self.pred, self.label = get_rcnn_names(cfg)
+
+    def update(self, labels, preds):
+        pred = preds[self.pred.index('rcnn_cls_prob')]
+        if self.ohem or self.e2e:
+            label = preds[self.pred.index('rcnn_label')]
+        else:
+            label = labels[self.label.index('rcnn_label')]
+
+        last_dim = pred.shape[-1]
+        pred_label = pred.asnumpy().reshape(-1, last_dim).argmax(axis=1).astype('int32')
+        label = label.asnumpy().reshape(-1,).astype('int32')
+
+        # filter with keep_inds
+        keep_inds = np.where(label != -1)
+        pred_label = pred_label[keep_inds]
+        label = label[keep_inds]
+
+        self.sum_metric += np.sum(pred_label.flat == label.flat)
+        self.num_inst += len(pred_label.flat)
+
+
+class RPNLogLossMetric(mx.metric.EvalMetric):
+    def __init__(self):
+        super(RPNLogLossMetric, self).__init__('RPNLogLoss')
+        self.pred, self.label = get_rpn_names()
+
+    def update(self, labels, preds):
+        pred = preds[self.pred.index('rpn_cls_prob')]
+        label = labels[self.label.index('rpn_label')]
+
+        # label (b, p)
+        label = label.asnumpy().astype('int32').reshape((-1))
+        # pred (b, c, p) or (b, c, h, w) --> (b, p, c) --> (b*p, c)
+        pred = pred.asnumpy().reshape((pred.shape[0], pred.shape[1], -1)).transpose((0, 2, 1))
+        pred = pred.reshape((label.shape[0], -1))
+
+        # filter with keep_inds
+        keep_inds = np.where(label != -1)[0]
+        label = label[keep_inds]
+        cls = pred[keep_inds, label]
+
+        cls += 1e-14
+        cls_loss = -1 * np.log(cls)
+        cls_loss = np.sum(cls_loss)
+        self.sum_metric += cls_loss
+        self.num_inst += label.shape[0]
+
+
+class RCNNLogLossMetric(mx.metric.EvalMetric):
+    def __init__(self, cfg):
+        super(RCNNLogLossMetric, self).__init__('RCNNLogLoss')
+        self.e2e = cfg.TRAIN.END2END
+        self.ohem = cfg.TRAIN.ENABLE_OHEM
+        self.pred, self.label = get_rcnn_names(cfg)
+
+    def update(self, labels, preds):
+        pred = preds[self.pred.index('rcnn_cls_prob')]
+        if self.ohem or self.e2e:
+            label = preds[self.pred.index('rcnn_label')]
+        else:
+            label = labels[self.label.index('rcnn_label')]
+
+        last_dim = pred.shape[-1]
+        pred = pred.asnumpy().reshape(-1, last_dim)
+        label = label.asnumpy().reshape(-1,).astype('int32')
+
+        # filter with keep_inds
+        keep_inds = np.where(label != -1)[0]
+        label = label[keep_inds]
+        cls = pred[keep_inds, label]
+
+        cls += 1e-14
+        cls_loss = -1 * np.log(cls)
+        cls_loss = np.sum(cls_loss)
+        self.sum_metric += cls_loss
+        self.num_inst += label.shape[0]
+
+
+class RPNL1LossMetric(mx.metric.EvalMetric):
+    def __init__(self):
+        super(RPNL1LossMetric, self).__init__('RPNL1Loss')
+        self.pred, self.label = get_rpn_names()
+
+    def update(self, labels, preds):
+        bbox_loss = preds[self.pred.index('rpn_bbox_loss')].asnumpy()
+
+        # calculate num_inst (average on those kept anchors)
+        label = labels[self.label.index('rpn_label')].asnumpy()
+        num_inst = np.sum(label != -1)
+
+        self.sum_metric += np.sum(bbox_loss)
+        self.num_inst += num_inst
+
+
+class RCNNL1LossMetric(mx.metric.EvalMetric):
+    def __init__(self, cfg):
+        super(RCNNL1LossMetric, self).__init__('RCNNL1Loss')
+        self.e2e = cfg.TRAIN.END2END
+        self.ohem = cfg.TRAIN.ENABLE_OHEM
+        self.pred, self.label = get_rcnn_names(cfg)
+
+    def update(self, labels, preds):
+        bbox_loss = preds[self.pred.index('rcnn_bbox_loss')].asnumpy()
+        if self.ohem:
+            label = preds[self.pred.index('rcnn_label')].asnumpy()
+        else:
+            if self.e2e:
+                label = preds[self.pred.index('rcnn_label')].asnumpy()
+            else:
+                label = labels[self.label.index('rcnn_label')].asnumpy()
+
+        # calculate num_inst (average on those kept anchors)
+        num_inst = np.sum(label != -1)
+
+        self.sum_metric += np.sum(bbox_loss)
+        self.num_inst += num_inst
diff --git a/faster_rcnn/core/module.py b/faster_rcnn/core/module.py
new file mode 100644
index 0000000..25924fb
--- /dev/null
+++ b/faster_rcnn/core/module.py
@@ -0,0 +1,1067 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+"""A `MutableModule` implement the `BaseModule` API, and allows input shape
+varying with training iterations. If shapes vary, executors will rebind,
+using shared arrays from the initial module binded with maximum shape.
+"""
+
+import time
+import logging
+import warnings
+
+from mxnet import context as ctx
+from mxnet.initializer import Uniform, InitDesc
+from mxnet.module.base_module import BaseModule, _check_input_names, _parse_data_desc, _as_list
+from mxnet.model import _create_kvstore, _initialize_kvstore, _update_params, _update_params_on_kvstore, load_checkpoint, BatchEndParam
+from mxnet import metric
+
+from .DataParallelExecutorGroup import DataParallelExecutorGroup
+from mxnet import ndarray as nd
+from mxnet import optimizer as opt
+
+
+class Module(BaseModule):
+    """Module is a basic module that wrap a `Symbol`. It is functionally the same
+    as the `FeedForward` model, except under the module API.
+
+    Parameters
+    ----------
+    symbol : Symbol
+    data_names : list of str
+        Default is `('data')` for a typical model used in image classification.
+    label_names : list of str
+        Default is `('softmax_label')` for a typical model used in image
+        classification.
+    logger : Logger
+        Default is `logging`.
+    context : Context or list of Context
+        Default is `cpu()`.
+    work_load_list : list of number
+        Default `None`, indicating uniform workload.
+    fixed_param_names: list of str
+        Default `None`, indicating no network parameters are fixed.
+    state_names : list of str
+        states are similar to data and label, but not provided by data iterator.
+        Instead they are initialized to 0 and can be set by set_states()
+    """
+    def __init__(self, symbol, data_names=('data',), label_names=('softmax_label',),
+                 logger=logging, context=ctx.cpu(), work_load_list=None,
+                 fixed_param_names=None, state_names=None):
+        super(Module, self).__init__(logger=logger)
+
+        if isinstance(context, ctx.Context):
+            context = [context]
+        self._context = context
+        if work_load_list is None:
+            work_load_list = [1] * len(self._context)
+        assert len(work_load_list) == len(self._context)
+        self._work_load_list = work_load_list
+
+        self._symbol = symbol
+
+        data_names = list(data_names) if data_names is not None else []
+        label_names = list(label_names) if label_names is not None else []
+        state_names = list(state_names) if state_names is not None else []
+        fixed_param_names = list(fixed_param_names) if fixed_param_names is not None else []
+
+        _check_input_names(symbol, data_names, "data", True)
+        _check_input_names(symbol, label_names, "label", False)
+        _check_input_names(symbol, state_names, "state", True)
+        _check_input_names(symbol, fixed_param_names, "fixed_param", True)
+
+        arg_names = symbol.list_arguments()
+        input_names = data_names + label_names + state_names
+        self._param_names = [x for x in arg_names if x not in input_names]
+        self._fixed_param_names = fixed_param_names
+        self._aux_names = symbol.list_auxiliary_states()
+        self._data_names = data_names
+        self._label_names = label_names
+        self._state_names = state_names
+        self._output_names = symbol.list_outputs()
+
+        self._arg_params = None
+        self._aux_params = None
+        self._params_dirty = False
+
+        self._optimizer = None
+        self._kvstore = None
+        self._update_on_kvstore = None
+        self._updater = None
+        self._preload_opt_states = None
+        self._grad_req = None
+
+        self._exec_group = None
+        self._data_shapes = None
+        self._label_shapes = None
+
+    @staticmethod
+    def load(prefix, epoch, load_optimizer_states=False, **kwargs):
+        """Create a model from previously saved checkpoint.
+
+        Parameters
+        ----------
+        prefix : str
+            path prefix of saved model files. You should have
+            "prefix-symbol.json", "prefix-xxxx.params", and
+            optionally "prefix-xxxx.states", where xxxx is the
+            epoch number.
+        epoch : int
+            epoch to load.
+        load_optimizer_states : bool
+            whether to load optimizer states. Checkpoint needs
+            to have been made with save_optimizer_states=True.
+        data_names : list of str
+            Default is `('data')` for a typical model used in image classification.
+        label_names : list of str
+            Default is `('softmax_label')` for a typical model used in image
+            classification.
+        logger : Logger
+            Default is `logging`.
+        context : Context or list of Context
+            Default is `cpu()`.
+        work_load_list : list of number
+            Default `None`, indicating uniform workload.
+        fixed_param_names: list of str
+            Default `None`, indicating no network parameters are fixed.
+        """
+        sym, args, auxs = load_checkpoint(prefix, epoch)
+        mod = Module(symbol=sym, **kwargs)
+        mod._arg_params = args
+        mod._aux_params = auxs
+        mod.params_initialized = True
+        if load_optimizer_states:
+            mod._preload_opt_states = '%s-%04d.states'%(prefix, epoch)
+        return mod
+
+    def save_checkpoint(self, prefix, epoch, save_optimizer_states=False):
+        """Save current progress to checkpoint.
+        Use mx.callback.module_checkpoint as epoch_end_callback to save during training.
+
+        Parameters
+        ----------
+        prefix : str
+            The file prefix to checkpoint to
+        epoch : int
+            The current epoch number
+        save_optimizer_states : bool
+            Whether to save optimizer states for continue training
+        """
+        self._symbol.save('%s-symbol.json'%prefix)
+        param_name = '%s-%04d.params' % (prefix, epoch)
+        self.save_params(param_name)
+        logging.info('Saved checkpoint to \"%s\"', param_name)
+        if save_optimizer_states:
+            state_name = '%s-%04d.states' % (prefix, epoch)
+            self.save_optimizer_states(state_name)
+            logging.info('Saved optimizer state to \"%s\"', state_name)
+
+    def _reset_bind(self):
+        """Internal function to reset binded state."""
+        self.binded = False
+        self._exec_group = None
+        self._data_shapes = None
+        self._label_shapes = None
+
+    @property
+    def data_names(self):
+        """A list of names for data required by this module."""
+        return self._data_names
+
+    @property
+    def label_names(self):
+        """A list of names for labels required by this module."""
+        return self._label_names
+
+    @property
+    def output_names(self):
+        """A list of names for the outputs of this module."""
+        return self._output_names
+
+    @property
+    def data_shapes(self):
+        """Get data shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs.
+        """
+        assert self.binded
+        return self._data_shapes
+
+    @property
+    def label_shapes(self):
+        """Get label shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs. The return value could be `None` if
+        the module does not need labels, or if the module is not binded for
+        training (in this case, label information is not available).
+        """
+        assert self.binded
+        return self._label_shapes
+
+    @property
+    def output_shapes(self):
+        """Get output shapes.
+        Returns
+        -------
+        A list of `(name, shape)` pairs.
+        """
+        assert self.binded
+        return self._exec_group.get_output_shapes()
+
+    def get_params(self):
+        """Get current parameters.
+        Returns
+        -------
+        `(arg_params, aux_params)`, each a dictionary of name to parameters (in
+        `NDArray`) mapping.
+        """
+        assert self.binded and self.params_initialized
+
+        if self._params_dirty:
+            self._sync_params_from_devices()
+        return (self._arg_params, self._aux_params)
+
+    def init_params(self, initializer=Uniform(0.01), arg_params=None, aux_params=None,
+                    allow_missing=False, force_init=False):
+        """Initialize the parameters and auxiliary states.
+
+        Parameters
+        ----------
+        initializer : Initializer
+            Called to initialize parameters if needed.
+        arg_params : dict
+            If not None, should be a dictionary of existing arg_params. Initialization
+            will be copied from that.
+        aux_params : dict
+            If not None, should be a dictionary of existing aux_params. Initialization
+            will be copied from that.
+        allow_missing : bool
+            If true, params could contain missing values, and the initializer will be
+            called to fill those missing params.
+        force_init : bool
+            If true, will force re-initialize even if already initialized.
+        """
+        if self.params_initialized and not force_init:
+            warnings.warn("Parameters already initialized and force_init=False. "
+                          "init_params call ignored.", stacklevel=2)
+            return
+        assert self.binded, 'call bind before initializing the parameters'
+
+        def _impl(name, arr, cache):
+            """Internal helper for parameter initialization"""
+            if cache is not None:
+                if name in cache:
+                    cache_arr = cache[name]
+
+                    # just in case the cached array is just the target itself
+                    if cache_arr is not arr:
+                        cache_arr.copyto(arr)
+                else:
+                    if not allow_missing:
+                        raise RuntimeError("%s is not presented" % name)
+                    if initializer != None:
+                        initializer(name, arr)
+            else:
+                initializer(name, arr)
+
+        attrs = self._symbol.attr_dict()
+        for name, arr in self._arg_params.items():
+            desc = InitDesc(name, attrs.get(name, None))
+            _impl(desc, arr, arg_params)
+
+        for name, arr in self._aux_params.items():
+            desc = InitDesc(name, attrs.get(name, None))
+            _impl(desc, arr, aux_params)
+
+        self.params_initialized = True
+        self._params_dirty = False
+
+        # copy the initialized parameters to devices
+        self._exec_group.set_params(self._arg_params, self._aux_params)
+
+    def set_params(self, arg_params, aux_params, allow_missing=False, force_init=True):
+        """Assign parameter and aux state values.
+
+        Parameters
+        ----------
+        arg_params : dict
+            Dictionary of name to value (`NDArray`) mapping.
+        aux_params : dict
+            Dictionary of name to value (`NDArray`) mapping.
+        allow_missing : bool
+            If true, params could contain missing values, and the initializer will be
+            called to fill those missing params.
+        force_init : bool
+            If true, will force re-initialize even if already initialized.
+
+        Examples
+        --------
+        An example of setting module parameters::
+            >>> sym, arg_params, aux_params = \
+            >>>     mx.model.load_checkpoint(model_prefix, n_epoch_load)
+            >>> mod.set_params(arg_params=arg_params, aux_params=aux_params)
+        """
+        if not allow_missing:
+            self.init_params(initializer=None, arg_params=arg_params, aux_params=aux_params,
+                             allow_missing=allow_missing, force_init=force_init)
+            return
+
+        if self.params_initialized and not force_init:
+            warnings.warn("Parameters already initialized and force_init=False. "
+                          "set_params call ignored.", stacklevel=2)
+            return
+
+        self._exec_group.set_params(arg_params, aux_params)
+
+        # because we didn't update self._arg_params, they are dirty now.
+        self._params_dirty = True
+        self.params_initialized = True
+
+    def bind(self, data_shapes, label_shapes=None, for_training=True,
+             inputs_need_grad=False, force_rebind=False, shared_module=None,
+             grad_req='write'):
+        """Bind the symbols to construct executors. This is necessary before one
+        can perform computation with the module.
+
+        Parameters
+        ----------
+        data_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_data`.
+        label_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_label`.
+        for_training : bool
+            Default is `True`. Whether the executors should be bind for training.
+        inputs_need_grad : bool
+            Default is `False`. Whether the gradients to the input data need to be computed.
+            Typically this is not needed. But this might be needed when implementing composition
+            of modules.
+        force_rebind : bool
+            Default is `False`. This function does nothing if the executors are already
+            binded. But with this `True`, the executors will be forced to rebind.
+        shared_module : Module
+            Default is `None`. This is used in bucketing. When not `None`, the shared module
+            essentially corresponds to a different bucket -- a module with different symbol
+            but with the same sets of parameters (e.g. unrolled RNNs with different lengths).
+        """
+        # force rebinding is typically used when one want to switch from
+        # training to prediction phase.
+        if force_rebind:
+            self._reset_bind()
+
+        if self.binded:
+            self.logger.warning('Already binded, ignoring bind()')
+            return
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+        self.binded = True
+        self._grad_req = grad_req
+
+        if not for_training:
+            assert not inputs_need_grad
+        else:
+            pass
+            # this is not True, as some module might not contains a loss function
+            # that consumes the labels
+            # assert label_shapes is not None
+
+        # self._data_shapes, self._label_shapes = _parse_data_desc(
+        #     self.data_names, self.label_names, data_shapes, label_shapes)
+        self._data_shapes, self._label_shapes = zip(*[_parse_data_desc(self.data_names, self.label_names, data_shape, label_shape)
+                                                      for data_shape, label_shape in zip(data_shapes, label_shapes)])
+        if self._label_shapes.count(None) == len(self._label_shapes):
+            self._label_shapes = None
+
+        if shared_module is not None:
+            assert isinstance(shared_module, Module) and \
+                    shared_module.binded and shared_module.params_initialized
+            shared_group = shared_module._exec_group
+        else:
+            shared_group = None
+
+        self._exec_group = DataParallelExecutorGroup(self._symbol, self._context,
+                                                     self._work_load_list, self._data_shapes,
+                                                     self._label_shapes, self._param_names,
+                                                     for_training, inputs_need_grad,
+                                                     shared_group, logger=self.logger,
+                                                     fixed_param_names=self._fixed_param_names,
+                                                     grad_req=grad_req,
+                                                     state_names=self._state_names)
+        # self._total_exec_bytes = self._exec_group._total_exec_bytes
+        if shared_module is not None:
+            self.params_initialized = True
+            self._arg_params = shared_module._arg_params
+            self._aux_params = shared_module._aux_params
+        elif self.params_initialized:
+            # if the parameters are already initialized, we are re-binding
+            # so automatically copy the already initialized params
+            self._exec_group.set_params(self._arg_params, self._aux_params)
+        else:
+            assert self._arg_params is None and self._aux_params is None
+            param_arrays = [
+                nd.zeros(x[0].shape, dtype=x[0].dtype)
+                for x in self._exec_group.param_arrays
+            ]
+            self._arg_params = {name:arr for name, arr in zip(self._param_names, param_arrays)}
+
+            aux_arrays = [
+                nd.zeros(x[0].shape, dtype=x[0].dtype)
+                for x in self._exec_group.aux_arrays
+            ]
+            self._aux_params = {name:arr for name, arr in zip(self._aux_names, aux_arrays)}
+
+        if shared_module is not None and shared_module.optimizer_initialized:
+            self.borrow_optimizer(shared_module)
+
+
+    def reshape(self, data_shapes, label_shapes=None):
+        """Reshape the module for new input shapes.
+
+        Parameters
+        ----------
+        data_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_data`.
+        label_shapes : list of (str, tuple)
+            Typically is `data_iter.provide_label`.
+        """
+        assert self.binded
+        # self._data_shapes, self._label_shapes = _parse_data_desc(
+        #     self.data_names, self.label_names, data_shapes, label_shapes)
+        self._data_shapes, self._label_shapes = zip(*[_parse_data_desc(self.data_names, self.label_names, data_shape, label_shape)
+                                                      for data_shape, label_shape in zip(data_shapes, label_shapes)])
+
+        self._exec_group.reshape(self._data_shapes, self._label_shapes)
+
+
+    def init_optimizer(self, kvstore='local', optimizer='sgd',
+                       optimizer_params=(('learning_rate', 0.01),), force_init=False):
+        """Install and initialize optimizers.
+
+        Parameters
+        ----------
+        kvstore : str or KVStore
+            Default `'local'`.
+        optimizer : str or Optimizer
+            Default `'sgd'`
+        optimizer_params : dict
+            Default `(('learning_rate', 0.01),)`. The default value is not a dictionary,
+            just to avoid pylint warning of dangerous default values.
+        force_init : bool
+            Default `False`, indicating whether we should force re-initializing the
+            optimizer in the case an optimizer is already installed.
+        """
+        assert self.binded and self.params_initialized
+
+        if self.optimizer_initialized and not force_init:
+            self.logger.warning('optimizer already initialized, ignoring...')
+            return
+
+        (kvstore, update_on_kvstore) = \
+                _create_kvstore(kvstore, len(self._context), self._arg_params)
+
+        batch_size = self._exec_group.batch_size
+        if kvstore and 'dist' in kvstore.type and '_sync' in kvstore.type:
+            batch_size *= kvstore.num_workers
+        rescale_grad = 1.0/batch_size
+
+        if isinstance(optimizer, str):
+            idx2name = {}
+            if update_on_kvstore:
+                idx2name.update(enumerate(self._exec_group.param_names))
+            else:
+                for k in range(len(self._context)):
+                    idx2name.update({i*len(self._context)+k: n
+                                     for i, n in enumerate(self._exec_group.param_names)})
+            optimizer_params = dict(optimizer_params)
+            if 'rescale_grad' not in optimizer_params:
+                optimizer_params['rescale_grad'] = rescale_grad
+            optimizer = opt.create(optimizer,
+                                   sym=self.symbol, param_idx2name=idx2name,
+                                   **optimizer_params)
+        else:
+            assert isinstance(optimizer, opt.Optimizer)
+            if optimizer.rescale_grad != rescale_grad:
+                #pylint: disable=no-member
+                warnings.warn(
+                    "Optimizer created manually outside Module but rescale_grad " +
+                    "is not normalized to 1.0/batch_size/num_workers (%s vs. %s). "%(
+                        optimizer.rescale_grad, rescale_grad) +
+                    "Is this intended?", stacklevel=2)
+
+        self._optimizer = optimizer
+        self._kvstore = kvstore
+        self._update_on_kvstore = update_on_kvstore
+        self._updater = None
+
+        if kvstore:
+            # copy initialized local parameters to kvstore
+            _initialize_kvstore(kvstore=kvstore,
+                                param_arrays=self._exec_group.param_arrays,
+                                arg_params=self._arg_params,
+                                param_names=self._param_names,
+                                update_on_kvstore=update_on_kvstore)
+        if update_on_kvstore:
+            kvstore.set_optimizer(self._optimizer)
+        else:
+            self._updater = opt.get_updater(optimizer)
+
+        self.optimizer_initialized = True
+
+        if self._preload_opt_states is not None:
+            self.load_optimizer_states(self._preload_opt_states)
+            self._preload_opt_states = None
+
+    def borrow_optimizer(self, shared_module):
+        """Borrow optimizer from a shared module. Used in bucketing, where exactly the same
+        optimizer (esp. kvstore) is used.
+
+        Parameters
+        ----------
+        shared_module : Module
+        """
+        assert shared_module.optimizer_initialized
+        self._optimizer = shared_module._optimizer
+        self._kvstore = shared_module._kvstore
+        self._update_on_kvstore = shared_module._update_on_kvstore
+        self._updater = shared_module._updater
+        self.optimizer_initialized = True
+
+    def forward(self, data_batch, is_train=None):
+        """Forward computation.
+
+        Parameters
+        ----------
+        data_batch : DataBatch
+            Could be anything with similar API implemented.
+        is_train : bool
+            Default is `None`, which means `is_train` takes the value of `self.for_training`.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.forward(data_batch, is_train)
+
+    def backward(self, out_grads=None):
+        """Backward computation.
+
+        Parameters
+        ----------
+        out_grads : NDArray or list of NDArray, optional
+            Gradient on the outputs to be propagated back.
+            This parameter is only needed when bind is called
+            on outputs that are not a loss function.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.backward(out_grads=out_grads)
+
+    def update(self):
+        """Update parameters according to the installed optimizer and the gradients computed
+        in the previous forward-backward batch.
+        """
+        assert self.binded and self.params_initialized and self.optimizer_initialized
+
+        self._params_dirty = True
+        if self._update_on_kvstore:
+            _update_params_on_kvstore(self._exec_group.param_arrays,
+                                      self._exec_group.grad_arrays,
+                                      self._kvstore)
+        else:
+            _update_params(self._exec_group.param_arrays,
+                           self._exec_group.grad_arrays,
+                           updater=self._updater,
+                           num_device=len(self._context),
+                           kvstore=self._kvstore)
+
+    def get_outputs(self, merge_multi_context=True):
+        """Get outputs of the previous forward computation.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized
+        return self._exec_group.get_outputs(merge_multi_context=merge_multi_context)
+
+    def get_input_grads(self, merge_multi_context=True):
+        """Get the gradients with respect to the inputs of the module.
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the outputs
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[grad1, grad2]`. Otherwise, it
+        is like `[[grad1_dev1, grad1_dev2], [grad2_dev1, grad2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized and self.inputs_need_grad
+        return self._exec_group.get_input_grads(merge_multi_context=merge_multi_context)
+
+    def get_states(self, merge_multi_context=True):
+        """Get states from all devices
+
+        Parameters
+        ----------
+        merge_multi_context : bool
+            Default is `True`. In the case when data-parallelism is used, the states
+            will be collected from multiple devices. A `True` value indicate that we
+            should merge the collected results so that they look like from a single
+            executor.
+
+        Returns
+        -------
+        If `merge_multi_context` is `True`, it is like `[out1, out2]`. Otherwise, it
+        is like `[[out1_dev1, out1_dev2], [out2_dev1, out2_dev2]]`. All the output
+        elements are `NDArray`.
+        """
+        assert self.binded and self.params_initialized
+        return self._exec_group.get_states(merge_multi_context=merge_multi_context)
+
+    def set_states(self, states=None, value=None):
+        """Set value for states. Only one of states & value can be specified.
+
+        Parameters
+        ----------
+        states : list of list of NDArrays
+            source states arrays formatted like [[state1_dev1, state1_dev2],
+            [state2_dev1, state2_dev2]].
+        value : number
+            a single scalar value for all state arrays.
+        """
+        assert self.binded and self.params_initialized
+        self._exec_group.set_states(states, value)
+
+    def update_metric(self, eval_metric, labels):
+        """Evaluate and accumulate evaluation metric on outputs of the last forward computation.
+
+        Parameters
+        ----------
+        eval_metric : EvalMetric
+        labels : list of NDArray
+            Typically `data_batch.label`.
+        """
+        self._exec_group.update_metric(eval_metric, labels)
+
+    def _sync_params_from_devices(self):
+        """Synchronize parameters from devices to CPU. This function should be called after
+        calling `update` that updates the parameters on the devices, before one can read the
+        latest parameters from `self._arg_params` and `self._aux_params`.
+        """
+        self._exec_group.get_params(self._arg_params, self._aux_params)
+        self._params_dirty = False
+
+    def save_optimizer_states(self, fname):
+        """Save optimizer (updater) state to file
+
+        Parameters
+        ----------
+        fname : str
+            Path to output states file.
+        """
+        assert self.optimizer_initialized
+
+        if self._update_on_kvstore:
+            self._kvstore.save_optimizer_states(fname)
+        else:
+            with open(fname, 'wb') as fout:
+                fout.write(self._updater.get_states())
+
+    def load_optimizer_states(self, fname):
+        """Load optimizer (updater) state from file
+
+        Parameters
+        ----------
+        fname : str
+            Path to input states file.
+        """
+        assert self.optimizer_initialized
+
+        if self._update_on_kvstore:
+            self._kvstore.load_optimizer_states(fname)
+        else:
+            self._updater.set_states(open(fname, 'rb').read())
+
+    def install_monitor(self, mon):
+        """ Install monitor on all executors """
+        assert self.binded
+        self._exec_group.install_monitor(mon)
+
+
+class MutableModule(BaseModule):
+    """A mutable module is a module that supports variable input data.
+
+    Parameters
+    ----------
+    symbol : Symbol
+    data_names : list of str
+    label_names : list of str
+    logger : Logger
+    context : Context or list of Context
+    work_load_list : list of number
+    max_data_shapes : list of (name, shape) tuple, designating inputs whose shape vary
+    max_label_shapes : list of (name, shape) tuple, designating inputs whose shape vary
+    fixed_param_prefix : list of str, indicating fixed parameters
+    """
+    def __init__(self, symbol, data_names, label_names,
+                 logger=logging, context=ctx.cpu(), work_load_list=None,
+                 max_data_shapes=None, max_label_shapes=None, fixed_param_prefix=None):
+        super(MutableModule, self).__init__(logger=logger)
+        self._symbol = symbol
+        self._data_names = data_names
+        self._label_names = label_names
+        self._context = context
+        self._work_load_list = work_load_list
+
+        self._curr_module = None
+        self._max_data_shapes = max_data_shapes
+        self._max_label_shapes = max_label_shapes
+        self._fixed_param_prefix = fixed_param_prefix
+
+        fixed_param_names = list()
+        if fixed_param_prefix is not None:
+            for name in self._symbol.list_arguments():
+                for prefix in self._fixed_param_prefix:
+                    if prefix in name:
+                        fixed_param_names.append(name)
+        self._fixed_param_names = fixed_param_names
+        self._preload_opt_states = None
+
+    def _reset_bind(self):
+        self.binded = False
+        self._curr_module = None
+
+    @property
+    def data_names(self):
+        return self._data_names
+
+    @property
+    def output_names(self):
+        return self._symbol.list_outputs()
+
+    @property
+    def data_shapes(self):
+        assert self.binded
+        return self._curr_module.data_shapes
+
+    @property
+    def label_shapes(self):
+        assert self.binded
+        return self._curr_module.label_shapes
+
+    @property
+    def output_shapes(self):
+        assert self.binded
+        return self._curr_module.output_shapes
+
+    def get_params(self):
+        assert self.binded and self.params_initialized
+        return self._curr_module.get_params()
+
+    def init_params(self, initializer=Uniform(0.01), arg_params=None, aux_params=None,
+                    allow_missing=False, force_init=False):
+        if self.params_initialized and not force_init:
+            return
+        assert self.binded, 'call bind before initializing the parameters'
+        self._curr_module.init_params(initializer=initializer, arg_params=arg_params,
+                                      aux_params=aux_params, allow_missing=allow_missing,
+                                      force_init=force_init)
+        self.params_initialized = True
+
+    def bind(self, data_shapes, label_shapes=None, for_training=True,
+             inputs_need_grad=False, force_rebind=False, shared_module=None, grad_req='write'):
+        # in case we already initialized params, keep it
+        if self.params_initialized:
+            arg_params, aux_params = self.get_params()
+
+        # force rebinding is typically used when one want to switch from
+        # training to prediction phase.
+        if force_rebind:
+            self._reset_bind()
+
+        if self.binded:
+            self.logger.warning('Already binded, ignoring bind()')
+            return
+
+        assert shared_module is None, 'shared_module for MutableModule is not supported'
+
+        self.for_training = for_training
+        self.inputs_need_grad = inputs_need_grad
+        self.binded = True
+
+        max_shapes_dict = dict()
+        if self._max_data_shapes is not None:
+            max_shapes_dict.update(dict(self._max_data_shapes[0]))
+        if self._max_label_shapes is not None:
+            max_shapes_dict.update(dict(self._max_label_shapes[0]))
+
+        max_data_shapes = list()
+        for name, shape in data_shapes[0]:
+            if name in max_shapes_dict:
+                max_data_shapes.append((name, max_shapes_dict[name]))
+            else:
+                max_data_shapes.append((name, shape))
+
+        max_label_shapes = list()
+        if not label_shapes.count(None) == len(label_shapes):
+            for name, shape in label_shapes[0]:
+                if name in max_shapes_dict:
+                    max_label_shapes.append((name, max_shapes_dict[name]))
+                else:
+                    max_label_shapes.append((name, shape))
+
+        if len(max_label_shapes) == 0:
+            max_label_shapes = None
+
+        module = Module(self._symbol, self._data_names, self._label_names, logger=self.logger,
+                        context=self._context, work_load_list=self._work_load_list,
+                        fixed_param_names=self._fixed_param_names)
+        module.bind([max_data_shapes for _ in xrange(len(self._context))], [max_label_shapes for _ in xrange(len(self._context))],
+                    for_training, inputs_need_grad, force_rebind=False, shared_module=None)
+        self._curr_module = module
+
+        # copy back saved params, if already initialized
+        if self.params_initialized:
+            self.set_params(arg_params, aux_params)
+
+    def save_checkpoint(self, prefix, epoch, save_optimizer_states=False):
+        """Save current progress to checkpoint.
+        Use mx.callback.module_checkpoint as epoch_end_callback to save during training.
+
+        Parameters
+        ----------
+        prefix : str
+            The file prefix to checkpoint to
+        epoch : int
+            The current epoch number
+        save_optimizer_states : bool
+            Whether to save optimizer states for continue training
+        """
+        self._curr_module.save_checkpoint(prefix, epoch, save_optimizer_states)
+
+    def init_optimizer(self, kvstore='local', optimizer='sgd',
+                       optimizer_params=(('learning_rate', 0.01),), force_init=False):
+        assert self.binded and self.params_initialized
+        if self.optimizer_initialized and not force_init:
+            self.logger.warning('optimizer already initialized, ignoring.')
+            return
+
+        self._curr_module._preload_opt_states = self._preload_opt_states
+        self._curr_module.init_optimizer(kvstore, optimizer, optimizer_params,
+                                         force_init=force_init)
+        self.optimizer_initialized = True
+
+    def fit(self, train_data, eval_data=None, eval_metric='acc',
+            epoch_end_callback=None, batch_end_callback=None, kvstore='local',
+            optimizer='sgd', optimizer_params=(('learning_rate', 0.01),),
+            eval_end_callback=None,
+            eval_batch_end_callback=None, initializer=Uniform(0.01),
+            arg_params=None, aux_params=None, allow_missing=False,
+            force_rebind=False, force_init=False, begin_epoch=0, num_epoch=None,
+            validation_metric=None, monitor=None, prefix=None):
+        """Train the module parameters.
+
+        Parameters
+        ----------
+        train_data : DataIter
+        eval_data : DataIter
+            If not `None`, will be used as validation set and evaluate the performance
+            after each epoch.
+        eval_metric : str or EvalMetric
+            Default `'acc'`. The performance measure used to display during training.
+        epoch_end_callback : function or list of function
+            Each callback will be called with the current `epoch`, `symbol`, `arg_params`
+            and `aux_params`.
+        batch_end_callback : function or list of function
+            Each callback will be called with a `BatchEndParam`.
+        kvstore : str or KVStore
+            Default `'local'`.
+        optimizer : str or Optimizer
+            Default `'sgd'`
+        optimizer_params : dict
+            Default `(('learning_rate', 0.01),)`. The parameters for the optimizer constructor.
+            The default value is not a `dict`, just to avoid pylint warning on dangerous
+            default values.
+        eval_end_callback : function or list of function
+            These will be called at the end of each full evaluation, with the metrics over
+            the entire evaluation set.
+        eval_batch_end_callback : function or list of function
+            These will be called at the end of each minibatch during evaluation
+        initializer : Initializer
+            Will be called to initialize the module parameters if not already initialized.
+        arg_params : dict
+            Default `None`, if not `None`, should be existing parameters from a trained
+            model or loaded from a checkpoint (previously saved model). In this case,
+            the value here will be used to initialize the module parameters, unless they
+            are already initialized by the user via a call to `init_params` or `fit`.
+            `arg_params` has higher priority to `initializer`.
+        aux_params : dict
+            Default `None`. Similar to `arg_params`, except for auxiliary states.
+        allow_missing : bool
+            Default `False`. Indicate whether we allow missing parameters when `arg_params`
+            and `aux_params` are not `None`. If this is `True`, then the missing parameters
+            will be initialized via the `initializer`.
+        force_rebind : bool
+            Default `False`. Whether to force rebinding the executors if already binded.
+        force_init : bool
+            Default `False`. Indicate whether we should force initialization even if the
+            parameters are already initialized.
+        begin_epoch : int
+            Default `0`. Indicate the starting epoch. Usually, if we are resuming from a
+            checkpoint saved at a previous training phase at epoch N, then we should specify
+            this value as N+1.
+        num_epoch : int
+            Number of epochs to run training.
+
+        Examples
+        --------
+        An example of using fit for training::
+            >>> #Assume training dataIter and validation dataIter are ready
+            >>> mod.fit(train_data=train_dataiter, eval_data=val_dataiter,
+                        optimizer_params={'learning_rate':0.01, 'momentum': 0.9},
+                        num_epoch=10)
+        """
+        assert num_epoch is not None, 'please specify number of epochs'
+
+        self.bind(data_shapes=train_data.provide_data, label_shapes=train_data.provide_label,
+                  for_training=True, force_rebind=force_rebind)
+        if monitor is not None:
+            self.install_monitor(monitor)
+        self.init_params(initializer=initializer, arg_params=arg_params, aux_params=aux_params,
+                         allow_missing=allow_missing, force_init=force_init)
+        self.init_optimizer(kvstore=kvstore, optimizer=optimizer,
+                            optimizer_params=optimizer_params)
+
+        if validation_metric is None:
+            validation_metric = eval_metric
+        if not isinstance(eval_metric, metric.EvalMetric):
+            eval_metric = metric.create(eval_metric)
+
+        ################################################################################
+        # training loop
+        ################################################################################
+        for epoch in range(begin_epoch, num_epoch):
+            tic = time.time()
+            eval_metric.reset()
+            for nbatch, data_batch in enumerate(train_data):
+                if monitor is not None:
+                    monitor.tic()
+                self.forward_backward(data_batch)
+                self.update()
+                self.update_metric(eval_metric, data_batch.label)
+
+                if monitor is not None:
+                    monitor.toc_print()
+
+                if batch_end_callback is not None:
+                    batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
+                                                     eval_metric=eval_metric,
+                                                     locals=locals())
+                    for callback in _as_list(batch_end_callback):
+                        callback(batch_end_params)
+
+            # one epoch of training is finished
+            for name, val in eval_metric.get_name_value():
+                self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)
+            toc = time.time()
+            self.logger.info('Epoch[%d] Time cost=%.3f', epoch, (toc-tic))
+
+            # sync aux params across devices
+            arg_params, aux_params = self.get_params()
+            self.set_params(arg_params, aux_params)
+
+            if epoch_end_callback is not None:
+                for callback in _as_list(epoch_end_callback):
+                    callback(epoch, self.symbol, arg_params, aux_params)
+
+            #----------------------------------------
+            # evaluation on validation set
+            if eval_data:
+                res = self.score(eval_data, validation_metric,
+                                 score_end_callback=eval_end_callback,
+                                 batch_end_callback=eval_batch_end_callback, epoch=epoch)
+                #TODO: pull this into default
+                for name, val in res:
+                    self.logger.info('Epoch[%d] Validation-%s=%f', epoch, name, val)
+
+            # end of 1 epoch, reset the data-iter for another epoch
+            train_data.reset()
+
+
+    def forward(self, data_batch, is_train=None):
+        assert self.binded and self.params_initialized
+
+        # get current_shapes
+        if self._curr_module.label_shapes is not None:
+            current_shapes = [dict(self._curr_module.data_shapes[i] + self._curr_module.label_shapes[i]) for i in xrange(len(self._context))]
+        else:
+            current_shapes = [dict(self._curr_module.data_shapes[i]) for i in xrange(len(self._context))]
+
+        # get input_shapes
+        if is_train:
+            input_shapes = [dict(data_batch.provide_data[i] + data_batch.provide_label[i]) for i in xrange(len(self._context))]
+        else:
+            input_shapes = [dict(data_batch.provide_data[i]) for i in xrange(len(data_batch.provide_data))]
+
+        # decide if shape changed
+        shape_changed = len(current_shapes) != len(input_shapes)
+        for pre, cur in zip(current_shapes, input_shapes):
+            for k, v in pre.items():
+                if v != cur[k]:
+                    shape_changed = True
+
+        if shape_changed:
+            # self._curr_module.reshape(data_batch.provide_data, data_batch.provide_label)
+            module = Module(self._symbol, self._data_names, self._label_names,
+                            logger=self.logger, context=[self._context[i] for i in xrange(len(data_batch.provide_data))],
+                            work_load_list=self._work_load_list,
+                            fixed_param_names=self._fixed_param_names)
+            module.bind(data_batch.provide_data, data_batch.provide_label, self._curr_module.for_training,
+                        self._curr_module.inputs_need_grad, force_rebind=False,
+                        shared_module=self._curr_module)
+            self._curr_module = module
+
+        self._curr_module.forward(data_batch, is_train=is_train)
+
+    def backward(self, out_grads=None):
+        assert self.binded and self.params_initialized
+        self._curr_module.backward(out_grads=out_grads)
+
+    def update(self):
+        assert self.binded and self.params_initialized and self.optimizer_initialized
+        self._curr_module.update()
+
+    def get_outputs(self, merge_multi_context=True):
+        assert self.binded and self.params_initialized
+        return self._curr_module.get_outputs(merge_multi_context=merge_multi_context)
+    def get_input_grads(self, merge_multi_context=True):
+        assert self.binded and self.params_initialized and self.inputs_need_grad
+        return self._curr_module.get_input_grads(merge_multi_context=merge_multi_context)
+
+    def update_metric(self, eval_metric, labels):
+        assert self.binded and self.params_initialized
+        self._curr_module.update_metric(eval_metric, labels)
+
+    def install_monitor(self, mon):
+        """ Install monitor on all executors """
+        assert self.binded
+        self._curr_module.install_monitor(mon)
diff --git a/faster_rcnn/core/rcnn.py b/faster_rcnn/core/rcnn.py
new file mode 100644
index 0000000..d3863ac
--- /dev/null
+++ b/faster_rcnn/core/rcnn.py
@@ -0,0 +1,186 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+"""
+Fast R-CNN:
+data =
+    {'data': [num_images, c, h, w],
+    'rois': [num_rois, 5]}
+label =
+    {'label': [num_rois],
+    'bbox_target': [num_rois, 4 * num_classes],
+    'bbox_weight': [num_rois, 4 * num_classes]}
+roidb extended format [image_index]
+    ['image', 'height', 'width', 'flipped',
+     'boxes', 'gt_classes', 'gt_overlaps', 'max_classes', 'max_overlaps', 'bbox_targets']
+"""
+
+import numpy as np
+import numpy.random as npr
+
+from utils.image import get_image, tensor_vstack
+from bbox.bbox_transform import bbox_overlaps, bbox_transform
+from bbox.bbox_regression import expand_bbox_regression_targets
+
+
+def get_rcnn_testbatch(roidb, cfg):
+    """
+    return a dict of testbatch
+    :param roidb: ['image', 'flipped'] + ['boxes']
+    :return: data, label, im_info
+    """
+    # assert len(roidb) == 1, 'Single batch only'
+    imgs, roidb = get_image(roidb, cfg)
+    im_array = imgs
+    im_info = [np.array([roidb[i]['im_info']], dtype=np.float32) for i in range(len(roidb))]
+
+    im_rois = [roidb[i]['boxes'] for i in range(len(roidb))]
+    rois = im_rois
+    rois_array = [np.hstack((0 * np.ones((rois[i].shape[0], 1)), rois[i])) for i in range(len(rois))]
+
+    data = [{'data': im_array[i],
+             'rois': rois_array[i]} for i in range(len(roidb))]
+    label = {}
+
+    return data, label, im_info
+
+
+def get_rcnn_batch(roidb, cfg):
+    """
+    return a dict of multiple images
+    :param roidb: a list of dict, whose length controls batch size
+    ['images', 'flipped'] + ['gt_boxes', 'boxes', 'gt_overlap'] => ['bbox_targets']
+    :return: data, label
+    """
+    num_images = len(roidb)
+    imgs, roidb = get_image(roidb, cfg)
+    im_array = tensor_vstack(imgs)
+
+    assert cfg.TRAIN.BATCH_ROIS == -1 or cfg.TRAIN.BATCH_ROIS % cfg.TRAIN.BATCH_IMAGES == 0, \
+        'BATCHIMAGES {} must divide BATCH_ROIS {}'.format(cfg.TRAIN.BATCH_IMAGES, cfg.TRAIN.BATCH_ROIS)
+
+    if cfg.TRAIN.BATCH_ROIS == -1:
+        rois_per_image = np.sum([iroidb['boxes'].shape[0] for iroidb in roidb])
+        fg_rois_per_image = rois_per_image
+    else:
+        rois_per_image = cfg.TRAIN.BATCH_ROIS / cfg.TRAIN.BATCH_IMAGES
+        fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(int)
+
+    rois_array = list()
+    labels_array = list()
+    bbox_targets_array = list()
+    bbox_weights_array = list()
+
+    for im_i in range(num_images):
+        roi_rec = roidb[im_i]
+
+        # infer num_classes from gt_overlaps
+        num_classes = roi_rec['gt_overlaps'].shape[1]
+
+        # label = class RoI has max overlap with
+        rois = roi_rec['boxes']
+        labels = roi_rec['max_classes']
+        overlaps = roi_rec['max_overlaps']
+        bbox_targets = roi_rec['bbox_targets']
+
+        im_rois, labels, bbox_targets, bbox_weights = \
+            sample_rois(rois, fg_rois_per_image, rois_per_image, num_classes, cfg,
+                        labels, overlaps, bbox_targets)
+
+        # project im_rois
+        # do not round roi
+        rois = im_rois
+        batch_index = im_i * np.ones((rois.shape[0], 1))
+        rois_array_this_image = np.hstack((batch_index, rois))
+        rois_array.append(rois_array_this_image)
+
+        # add labels
+        labels_array.append(labels)
+        bbox_targets_array.append(bbox_targets)
+        bbox_weights_array.append(bbox_weights)
+
+    rois_array = np.array(rois_array)
+    labels_array = np.array(labels_array)
+    bbox_targets_array = np.array(bbox_targets_array)
+    bbox_weights_array = np.array(bbox_weights_array)
+
+    data = {'data': im_array,
+            'rois': rois_array}
+    label = {'label': labels_array,
+             'bbox_target': bbox_targets_array,
+             'bbox_weight': bbox_weights_array}
+
+    return data, label
+
+
+def sample_rois(rois, fg_rois_per_image, rois_per_image, num_classes, cfg,
+                labels=None, overlaps=None, bbox_targets=None, gt_boxes=None):
+    """
+    generate random sample of ROIs comprising foreground and background examples
+    :param rois: all_rois [n, 4]; e2e: [n, 5] with batch_index
+    :param fg_rois_per_image: foreground roi number
+    :param rois_per_image: total roi number
+    :param num_classes: number of classes
+    :param labels: maybe precomputed
+    :param overlaps: maybe precomputed (max_overlaps)
+    :param bbox_targets: maybe precomputed
+    :param gt_boxes: optional for e2e [n, 5] (x1, y1, x2, y2, cls)
+    :return: (labels, rois, bbox_targets, bbox_weights)
+    """
+    if labels is None:
+        overlaps = bbox_overlaps(rois[:, 1:].astype(np.float), gt_boxes[:, :4].astype(np.float))
+        gt_assignment = overlaps.argmax(axis=1)
+        overlaps = overlaps.max(axis=1)
+        labels = gt_boxes[gt_assignment, 4]
+
+    # foreground RoI with FG_THRESH overlap
+    fg_indexes = np.where(overlaps >= cfg.TRAIN.FG_THRESH)[0]
+    # guard against the case when an image has fewer than fg_rois_per_image foreground RoIs
+    fg_rois_per_this_image = np.minimum(fg_rois_per_image, fg_indexes.size)
+    # Sample foreground regions without replacement
+    if len(fg_indexes) > fg_rois_per_this_image:
+        fg_indexes = npr.choice(fg_indexes, size=fg_rois_per_this_image, replace=False)
+
+    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
+    bg_indexes = np.where((overlaps < cfg.TRAIN.BG_THRESH_HI) & (overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
+    # Compute number of background RoIs to take from this image (guarding against there being fewer than desired)
+    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image
+    bg_rois_per_this_image = np.minimum(bg_rois_per_this_image, bg_indexes.size)
+    # Sample foreground regions without replacement
+    if len(bg_indexes) > bg_rois_per_this_image:
+        bg_indexes = npr.choice(bg_indexes, size=bg_rois_per_this_image, replace=False)
+
+    # indexes selected
+    keep_indexes = np.append(fg_indexes, bg_indexes)
+
+    # pad more to ensure a fixed minibatch size
+    while keep_indexes.shape[0] < rois_per_image:
+        gap = np.minimum(len(rois), rois_per_image - keep_indexes.shape[0])
+        gap_indexes = npr.choice(range(len(rois)), size=gap, replace=False)
+        keep_indexes = np.append(keep_indexes, gap_indexes)
+
+    # select labels
+    labels = labels[keep_indexes]
+    # set labels of bg_rois to be 0
+    labels[fg_rois_per_this_image:] = 0
+    rois = rois[keep_indexes]
+
+    # load or compute bbox_target
+    if bbox_targets is not None:
+        bbox_target_data = bbox_targets[keep_indexes, :]
+    else:
+        targets = bbox_transform(rois[:, 1:], gt_boxes[gt_assignment[keep_indexes], :4])
+        if cfg.TRAIN.BBOX_NORMALIZATION_PRECOMPUTED:
+            targets = ((targets - np.array(cfg.TRAIN.BBOX_MEANS))
+                       / np.array(cfg.TRAIN.BBOX_STDS))
+        bbox_target_data = np.hstack((labels[:, np.newaxis], targets))
+
+    bbox_targets, bbox_weights = \
+        expand_bbox_regression_targets(bbox_target_data, num_classes, cfg)
+
+    return rois, labels, bbox_targets, bbox_weights
+
diff --git a/faster_rcnn/core/tester.py b/faster_rcnn/core/tester.py
new file mode 100644
index 0000000..db7e433
--- /dev/null
+++ b/faster_rcnn/core/tester.py
@@ -0,0 +1,307 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import cPickle
+import os
+import time
+import mxnet as mx
+import numpy as np
+
+from module import MutableModule
+from utils import image
+from bbox.bbox_transform import bbox_pred, clip_boxes
+from nms.nms import py_nms_wrapper, cpu_nms_wrapper, gpu_nms_wrapper
+from utils.PrefetchingIter import PrefetchingIter
+
+
+class Predictor(object):
+    def __init__(self, symbol, data_names, label_names,
+                 context=mx.cpu(), max_data_shapes=None,
+                 provide_data=None, provide_label=None,
+                 arg_params=None, aux_params=None):
+        self._mod = MutableModule(symbol, data_names, label_names,
+                                  context=context, max_data_shapes=max_data_shapes)
+        self._mod.bind(provide_data, provide_label, for_training=False)
+        self._mod.init_params(arg_params=arg_params, aux_params=aux_params)
+
+    def predict(self, data_batch):
+        self._mod.forward(data_batch)
+        # [dict(zip(self._mod.output_names, _)) for _ in zip(*self._mod.get_outputs(merge_multi_context=False))]
+        return [dict(zip(self._mod.output_names, _)) for _ in zip(*self._mod.get_outputs(merge_multi_context=False))]
+
+
+def im_proposal(predictor, data_batch, data_names, scales):
+    output_all = predictor.predict(data_batch)
+
+    data_dict_all = [dict(zip(data_names, data_batch.data[i])) for i in xrange(len(data_batch.data))]
+    scores_all = []
+    boxes_all = []
+
+    for output, data_dict, scale in zip(output_all, data_dict_all, scales):
+        # drop the batch index
+        boxes = output['rois_output'].asnumpy()[:, 1:]
+        scores = output['rois_score'].asnumpy()
+
+        # transform to original scale
+        boxes = boxes / scale
+        scores_all.append(scores)
+        boxes_all.append(boxes)
+
+    return scores_all, boxes_all, data_dict_all
+
+
+def generate_proposals(predictor, test_data, imdb, cfg, vis=False, thresh=0.):
+    """
+    Generate detections results using RPN.
+    :param predictor: Predictor
+    :param test_data: data iterator, must be non-shuffled
+    :param imdb: image database
+    :param vis: controls visualization
+    :param thresh: thresh for valid detections
+    :return: list of detected boxes
+    """
+    assert vis or not test_data.shuffle
+    data_names = [k[0] for k in test_data.provide_data[0]]
+
+    if not isinstance(test_data, PrefetchingIter):
+        test_data = PrefetchingIter(test_data)
+
+    idx = 0
+    t = time.time()
+    imdb_boxes = list()
+    original_boxes = list()
+    for im_info, data_batch in test_data:
+        t1 = time.time() - t
+        t = time.time()
+
+        scales = [iim_info[0, 2] for iim_info in im_info]
+        scores_all, boxes_all, data_dict_all = im_proposal(predictor, data_batch, data_names, scales)
+        t2 = time.time() - t
+        t = time.time()
+        for delta, (scores, boxes, data_dict, scale) in enumerate(zip(scores_all, boxes_all, data_dict_all, scales)):
+            # assemble proposals
+            dets = np.hstack((boxes, scores))
+            original_boxes.append(dets)
+
+            # filter proposals
+            keep = np.where(dets[:, 4:] > thresh)[0]
+            dets = dets[keep, :]
+            imdb_boxes.append(dets)
+
+            if vis:
+                vis_all_detection(data_dict['data'].asnumpy(), [dets], ['obj'], scale, cfg)
+
+            print 'generating %d/%d' % (idx + 1, imdb.num_images), 'proposal %d' % (dets.shape[0]), \
+                'data %.4fs net %.4fs' % (t1, t2 / test_data.batch_size)
+            idx += 1
+
+
+    assert len(imdb_boxes) == imdb.num_images, 'calculations not complete'
+
+    # save results
+    rpn_folder = os.path.join(imdb.result_path, 'rpn_data')
+    if not os.path.exists(rpn_folder):
+        os.mkdir(rpn_folder)
+
+    rpn_file = os.path.join(rpn_folder, imdb.name + '_rpn.pkl')
+    with open(rpn_file, 'wb') as f:
+        cPickle.dump(imdb_boxes, f, cPickle.HIGHEST_PROTOCOL)
+
+    if thresh > 0:
+        full_rpn_file = os.path.join(rpn_folder, imdb.name + '_full_rpn.pkl')
+        with open(full_rpn_file, 'wb') as f:
+            cPickle.dump(original_boxes, f, cPickle.HIGHEST_PROTOCOL)
+
+    print 'wrote rpn proposals to {}'.format(rpn_file)
+    return imdb_boxes
+
+
+def im_detect(predictor, data_batch, data_names, scales, cfg):
+    output_all = predictor.predict(data_batch)
+
+    data_dict_all = [dict(zip(data_names, idata)) for idata in data_batch.data]
+    scores_all = []
+    pred_boxes_all = []
+    for output, data_dict, scale in zip(output_all, data_dict_all, scales):
+        if cfg.TEST.HAS_RPN:
+            rois = output['rois_output'].asnumpy()[:, 1:]
+        else:
+            rois = data_dict['rois'].asnumpy().reshape((-1, 5))[:, 1:]
+        im_shape = data_dict['data'].shape
+
+        # save output
+        scores = output['cls_prob_reshape_output'].asnumpy()[0]
+        bbox_deltas = output['bbox_pred_reshape_output'].asnumpy()[0]
+
+        # post processing
+        pred_boxes = bbox_pred(rois, bbox_deltas)
+        pred_boxes = clip_boxes(pred_boxes, im_shape[-2:])
+
+        # we used scaled image & roi to train, so it is necessary to transform them back
+        pred_boxes = pred_boxes / scale
+
+        scores_all.append(scores)
+        pred_boxes_all.append(pred_boxes)
+    return scores_all, pred_boxes_all, data_dict_all
+
+
+def pred_eval(predictor, test_data, imdb, cfg, vis=False, thresh=1e-3, logger=None, ignore_cache=True):
+    """
+    wrapper for calculating offline validation for faster data analysis
+    in this example, all threshold are set by hand
+    :param predictor: Predictor
+    :param test_data: data iterator, must be non-shuffle
+    :param imdb: image database
+    :param vis: controls visualization
+    :param thresh: valid detection threshold
+    :return:
+    """
+
+    det_file = os.path.join(imdb.result_path, imdb.name + '_detections.pkl')
+    if os.path.exists(det_file) and not ignore_cache:
+        with open(det_file, 'rb') as fid:
+            all_boxes = cPickle.load(fid)
+        info_str = imdb.evaluate_detections(all_boxes)
+        if logger:
+            logger.info('evaluate detections: \n{}'.format(info_str))
+        return
+
+    assert vis or not test_data.shuffle
+    data_names = [k[0] for k in test_data.provide_data[0]]
+
+    if not isinstance(test_data, PrefetchingIter):
+        test_data = PrefetchingIter(test_data)
+
+    nms = py_nms_wrapper(cfg.TEST.NMS)
+
+    # limit detections to max_per_image over all classes
+    max_per_image = cfg.TEST.max_per_image
+
+    num_images = imdb.num_images
+    # all detections are collected into:
+    #    all_boxes[cls][image] = N x 5 array of detections in
+    #    (x1, y1, x2, y2, score)
+    all_boxes = [[[] for _ in range(num_images)]
+                 for _ in range(imdb.num_classes)]
+
+    idx = 0
+    data_time, net_time, post_time = 0.0, 0.0, 0.0
+    t = time.time()
+    for im_info, data_batch in test_data:
+        t1 = time.time() - t
+        t = time.time()
+
+        scales = [iim_info[0, 2] for iim_info in im_info]
+        scores_all, boxes_all, data_dict_all = im_detect(predictor, data_batch, data_names, scales, cfg)
+
+        t2 = time.time() - t
+        t = time.time()
+        for delta, (scores, boxes, data_dict) in enumerate(zip(scores_all, boxes_all, data_dict_all)):
+            for j in range(1, imdb.num_classes):
+                indexes = np.where(scores[:, j] > thresh)[0]
+                cls_scores = scores[indexes, j, np.newaxis]
+                cls_boxes = boxes[indexes, 4:8] if cfg.CLASS_AGNOSTIC else boxes[indexes, j * 4:(j + 1) * 4]
+                cls_dets = np.hstack((cls_boxes, cls_scores))
+                keep = nms(cls_dets)
+                all_boxes[j][idx+delta] = cls_dets[keep, :]
+
+            if max_per_image > 0:
+                image_scores = np.hstack([all_boxes[j][idx+delta][:, -1]
+                                          for j in range(1, imdb.num_classes)])
+                if len(image_scores) > max_per_image:
+                    image_thresh = np.sort(image_scores)[-max_per_image]
+                    for j in range(1, imdb.num_classes):
+                        keep = np.where(all_boxes[j][idx+delta][:, -1] >= image_thresh)[0]
+                        all_boxes[j][idx+delta] = all_boxes[j][idx+delta][keep, :]
+
+            if vis:
+                boxes_this_image = [[]] + [all_boxes[j][idx+delta] for j in range(1, imdb.num_classes)]
+                vis_all_detection(data_dict['data'].asnumpy(), boxes_this_image, imdb.classes, scales[delta], cfg)
+
+        idx += test_data.batch_size
+        t3 = time.time() - t
+        t = time.time()
+        data_time += t1
+        net_time += t2
+        post_time += t3
+        print 'testing {}/{} data {:.4f}s net {:.4f}s post {:.4f}s'.format(idx, imdb.num_images, data_time / idx * test_data.batch_size, net_time / idx * test_data.batch_size, post_time / idx * test_data.batch_size)
+        if logger:
+            logger.info('testing {}/{} data {:.4f}s net {:.4f}s post {:.4f}s'.format(idx, imdb.num_images, data_time / idx * test_data.batch_size, net_time / idx * test_data.batch_size, post_time / idx * test_data.batch_size))
+
+    with open(det_file, 'wb') as f:
+        cPickle.dump(all_boxes, f, protocol=cPickle.HIGHEST_PROTOCOL)
+
+    info_str = imdb.evaluate_detections(all_boxes)
+    if logger:
+        logger.info('evaluate detections: \n{}'.format(info_str))
+
+
+def vis_all_detection(im_array, detections, class_names, scale, cfg, threshold=1e-3):
+    """
+    visualize all detections in one image
+    :param im_array: [b=1 c h w] in rgb
+    :param detections: [ numpy.ndarray([[x1 y1 x2 y2 score]]) for j in classes ]
+    :param class_names: list of names in imdb
+    :param scale: visualize the scaled image
+    :return:
+    """
+    import matplotlib.pyplot as plt
+    import random
+    im = image.transform_inverse(im_array, cfg.network.PIXEL_MEANS)
+    plt.imshow(im)
+    for j, name in enumerate(class_names):
+        if name == '__background__':
+            continue
+        color = (random.random(), random.random(), random.random())  # generate a random color
+        dets = detections[j]
+        for det in dets:
+            bbox = det[:4] * scale
+            score = det[-1]
+            if score < threshold:
+                continue
+            rect = plt.Rectangle((bbox[0], bbox[1]),
+                                 bbox[2] - bbox[0],
+                                 bbox[3] - bbox[1], fill=False,
+                                 edgecolor=color, linewidth=3.5)
+            plt.gca().add_patch(rect)
+            plt.gca().text(bbox[0], bbox[1] - 2,
+                           '{:s} {:.3f}'.format(name, score),
+                           bbox=dict(facecolor=color, alpha=0.5), fontsize=12, color='white')
+    plt.show()
+
+
+def draw_all_detection(im_array, detections, class_names, scale, cfg, threshold=1e-1):
+    """
+    visualize all detections in one image
+    :param im_array: [b=1 c h w] in rgb
+    :param detections: [ numpy.ndarray([[x1 y1 x2 y2 score]]) for j in classes ]
+    :param class_names: list of names in imdb
+    :param scale: visualize the scaled image
+    :return:
+    """
+    import cv2
+    import random
+    color_white = (255, 255, 255)
+    im = image.transform_inverse(im_array, cfg.network.PIXEL_MEANS)
+    # change to bgr
+    im = cv2.cvtColor(im, cv2.COLOR_RGB2BGR)
+    for j, name in enumerate(class_names):
+        if name == '__background__':
+            continue
+        color = (random.randint(0, 256), random.randint(0, 256), random.randint(0, 256))  # generate a random color
+        dets = detections[j]
+        for det in dets:
+            bbox = det[:4] * scale
+            score = det[-1]
+            if score < threshold:
+                continue
+            bbox = map(int, bbox)
+            cv2.rectangle(im, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color=color, thickness=2)
+            cv2.putText(im, '%s %.3f' % (class_names[j], score), (bbox[0], bbox[1] + 10),
+                        color=color_white, fontFace=cv2.FONT_HERSHEY_COMPLEX, fontScale=0.5)
+    return im
diff --git a/faster_rcnn/function/__init__.py b/faster_rcnn/function/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/faster_rcnn/function/test_rcnn.py b/faster_rcnn/function/test_rcnn.py
new file mode 100644
index 0000000..f25de84
--- /dev/null
+++ b/faster_rcnn/function/test_rcnn.py
@@ -0,0 +1,73 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Guodong Zhang
+# --------------------------------------------------------		  
+
+import argparse
+import pprint
+import logging
+import time
+import os
+import mxnet as mx
+
+from symbols import *
+from dataset import *
+from core.loader import TestLoader
+from core.tester import Predictor, pred_eval
+from utils.load_model import load_param
+
+
+def test_rcnn(cfg, dataset, image_set, root_path, dataset_path,
+              ctx, prefix, epoch,
+              vis, ignore_cache, shuffle, has_rpn, proposal, thresh, logger=None, output_path=None):
+    if not logger:
+        assert False, 'require a logger'
+
+    # print cfg
+    pprint.pprint(cfg)
+    logger.info('testing cfg:{}\n'.format(pprint.pformat(cfg)))
+
+    # load symbol and testing data
+    if has_rpn:
+        sym_instance = eval(cfg.symbol + '.' + cfg.symbol)()
+        sym = sym_instance.get_symbol(cfg, is_train=False)
+        imdb = eval(dataset)(image_set, root_path, dataset_path, result_path=output_path)
+        roidb = imdb.gt_roidb()
+    else:
+        sym_instance = eval(cfg.symbol + '.' + cfg.symbol)()
+        sym = sym_instance.get_symbol_rcnn(cfg, is_train=False)
+        imdb = eval(dataset)(image_set, root_path, dataset_path, result_path=output_path)
+        gt_roidb = imdb.gt_roidb()
+        roidb = eval('imdb.' + proposal + '_roidb')(gt_roidb)
+
+    # get test data iter
+    test_data = TestLoader(roidb, cfg, batch_size=len(ctx), shuffle=shuffle, has_rpn=has_rpn)
+
+    # load model
+    arg_params, aux_params = load_param(prefix, epoch, process=True)
+
+    # infer shape
+    data_shape_dict = dict(test_data.provide_data_single)
+    sym_instance.infer_shape(data_shape_dict)
+
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict, is_train=False)
+
+    # decide maximum shape
+    data_names = [k[0] for k in test_data.provide_data_single]
+    label_names = None
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in cfg.SCALES]), max([v[1] for v in cfg.SCALES])))]]
+    if not has_rpn:
+        max_data_shape.append(('rois', (cfg.TEST.PROPOSAL_POST_NMS_TOP_N + 30, 5)))
+
+    # create predictor
+    predictor = Predictor(sym, data_names, label_names,
+                          context=ctx, max_data_shapes=max_data_shape,
+                          provide_data=test_data.provide_data, provide_label=test_data.provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # start detection
+    pred_eval(predictor, test_data, imdb, cfg, vis=vis, ignore_cache=ignore_cache, thresh=thresh, logger=logger)
+
diff --git a/faster_rcnn/function/test_rpn.py b/faster_rcnn/function/test_rpn.py
new file mode 100644
index 0000000..8393495
--- /dev/null
+++ b/faster_rcnn/function/test_rpn.py
@@ -0,0 +1,71 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import argparse
+import pprint
+import logging
+import mxnet as mx
+
+from symbols import *
+from dataset import *
+from core.loader import TestLoader
+from core.tester import Predictor, generate_proposals
+from utils.load_model import load_param
+
+
+def test_rpn(cfg, dataset, image_set, root_path, dataset_path,
+             ctx, prefix, epoch,
+             vis, shuffle, thresh, logger=None, output_path=None):
+    # set up logger
+    if not logger:
+        logging.basicConfig()
+        logger = logging.getLogger()
+        logger.setLevel(logging.INFO)
+
+    # rpn generate proposal cfg
+    cfg.TEST.HAS_RPN = True
+
+    # print cfg
+    pprint.pprint(cfg)
+    logger.info('testing rpn cfg:{}\n'.format(pprint.pformat(cfg)))
+
+    # load symbol
+    sym_instance = eval(cfg.symbol + '.' + cfg.symbol)()
+    sym = sym_instance.get_symbol_rpn(cfg, is_train=False)
+
+    # load dataset and prepare imdb for training
+    imdb = eval(dataset)(image_set, root_path, dataset_path, result_path=output_path)
+    roidb = imdb.gt_roidb()
+    test_data = TestLoader(roidb, cfg, batch_size=len(ctx), shuffle=shuffle, has_rpn=True)
+
+    # load model
+    arg_params, aux_params = load_param(prefix, epoch)
+
+    # infer shape
+    data_shape_dict = dict(test_data.provide_data_single)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # check parameters
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict, is_train=False)
+
+    # decide maximum shape
+    data_names = [k[0] for k in test_data.provide_data[0]]
+    label_names = None if test_data.provide_label[0] is None else [k[0] for k in test_data.provide_label[0]]
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in cfg.SCALES]), max([v[1] for v in cfg.SCALES])))]]
+
+    # create predictor
+    predictor = Predictor(sym, data_names, label_names,
+                          context=ctx, max_data_shapes=max_data_shape,
+                          provide_data=test_data.provide_data, provide_label=test_data.provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # start testing
+    imdb_boxes = generate_proposals(predictor, test_data, imdb, cfg, vis=vis, thresh=thresh)
+
+    all_log_info = imdb.evaluate_recall(roidb, candidate_boxes=imdb_boxes)
+    logger.info(all_log_info)
diff --git a/faster_rcnn/function/train_rcnn.py b/faster_rcnn/function/train_rcnn.py
new file mode 100644
index 0000000..c9e2691
--- /dev/null
+++ b/faster_rcnn/function/train_rcnn.py
@@ -0,0 +1,136 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Guodong Zhang
+# --------------------------------------------------------			
+											  
+import argparse
+import logging
+import pprint
+import os
+import mxnet as mx
+import numpy as np
+
+from symbols import *
+from core import callback, metric
+from core.loader import ROIIter
+from core.module import MutableModule
+from bbox.bbox_regression import add_bbox_regression_targets
+from utils.load_data import load_proposal_roidb, merge_roidb, filter_roidb
+from utils.load_model import load_param
+from utils.PrefetchingIter import PrefetchingIter
+from utils.lr_scheduler import WarmupMultiFactorScheduler
+
+
+def train_rcnn(cfg, dataset, image_set, root_path, dataset_path,
+               frequent, kvstore, flip, shuffle, resume,
+               ctx, pretrained, epoch, prefix, begin_epoch, end_epoch,
+               train_shared, lr, lr_step, proposal, logger=None, output_path=None):
+    mx.random.seed(3)
+    np.random.seed(3)
+    # set up logger
+    if not logger:
+        logging.basicConfig()
+        logger = logging.getLogger()
+        logger.setLevel(logging.INFO)
+
+    # load symbol
+    sym_instance = eval(cfg.symbol + '.' + cfg.symbol)()
+    sym = sym_instance.get_symbol_rcnn(cfg, is_train=True)
+
+    # setup multi-gpu
+    batch_size = len(ctx)
+    input_batch_size = cfg.TRAIN.BATCH_IMAGES * batch_size
+
+    # print cfg
+    pprint.pprint(cfg)
+    logger.info('training rcnn cfg:{}\n'.format(pprint.pformat(cfg)))
+
+    # load dataset and prepare imdb for training
+    image_sets = [iset for iset in image_set.split('+')]
+    roidbs = [load_proposal_roidb(dataset, image_set, root_path, dataset_path,
+                                  proposal=proposal, append_gt=True, flip=flip, result_path=output_path)
+              for image_set in image_sets]
+    roidb = merge_roidb(roidbs)
+    roidb = filter_roidb(roidb, cfg)
+    means, stds = add_bbox_regression_targets(roidb, cfg)
+
+    # load training data
+    train_data = ROIIter(roidb, cfg, batch_size=input_batch_size, shuffle=shuffle,
+                         ctx=ctx, aspect_grouping=cfg.TRAIN.ASPECT_GROUPING)
+
+    # infer max shape
+    max_data_shape = [('data', (cfg.TRAIN.BATCH_IMAGES, 3, max([v[0] for v in cfg.SCALES]), max([v[1] for v in cfg.SCALES])))]
+
+    # infer shape
+    data_shape_dict = dict(train_data.provide_data_single + train_data.provide_label_single)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # load and initialize params
+    if resume:
+        print('continue training from ', begin_epoch)
+        arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
+    else:
+        arg_params, aux_params = load_param(pretrained, epoch, convert=True)
+        sym_instance.init_weight_rcnn(cfg, arg_params, aux_params)
+
+    # check parameter shapes
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict)
+
+    # prepare training
+    # create solver
+    data_names = [k[0] for k in train_data.provide_data_single]
+    label_names = [k[0] for k in train_data.provide_label_single]
+    if train_shared:
+        fixed_param_prefix = cfg.network.FIXED_PARAMS_SHARED
+    else:
+        fixed_param_prefix = cfg.network.FIXED_PARAMS
+    mod = MutableModule(sym, data_names=data_names, label_names=label_names,
+                        logger=logger, context=ctx,
+                        max_data_shapes=[max_data_shape for _ in range(batch_size)], fixed_param_prefix=fixed_param_prefix)
+
+    if cfg.TRAIN.RESUME:
+        mod._preload_opt_states = '%s-%04d.states'%(prefix, begin_epoch)
+
+
+    # decide training params
+    # metric
+    eval_metric = metric.RCNNAccMetric(cfg)
+    cls_metric = metric.RCNNLogLossMetric(cfg)
+    bbox_metric = metric.RCNNL1LossMetric(cfg)
+    eval_metrics = mx.metric.CompositeEvalMetric()
+    for child_metric in [eval_metric, cls_metric, bbox_metric]:
+        eval_metrics.add(child_metric)
+    # callback
+    batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=frequent)
+    epoch_end_callback = [mx.callback.module_checkpoint(mod, prefix, period=1, save_optimizer_states=True),
+                          callback.do_checkpoint(prefix, means, stds)]
+    # decide learning rate
+    base_lr = lr
+    lr_factor = cfg.TRAIN.lr_factor
+    lr_epoch = [float(epoch) for epoch in lr_step.split(',')]
+    lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
+    lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
+    lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
+    print('lr', lr, 'lr_epoch_diff', lr_epoch_diff, 'lr_iters', lr_iters)
+    lr_scheduler = WarmupMultiFactorScheduler(lr_iters, lr_factor, cfg.TRAIN.warmup, cfg.TRAIN.warmup_lr, cfg.TRAIN.warmup_step)
+    # optimizer
+    optimizer_params = {'momentum': cfg.TRAIN.momentum,
+                        'wd': cfg.TRAIN.wd,
+                        'learning_rate': lr,
+                        'lr_scheduler': lr_scheduler,
+                        'rescale_grad': 1.0,
+                        'clip_gradient': None}
+
+    # train
+
+    if not isinstance(train_data, PrefetchingIter):
+        train_data = PrefetchingIter(train_data)
+
+    mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
+            batch_end_callback=batch_end_callback, kvstore=kvstore,
+            optimizer='sgd', optimizer_params=optimizer_params,
+            arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
+
diff --git a/faster_rcnn/function/train_rpn.py b/faster_rcnn/function/train_rpn.py
new file mode 100644
index 0000000..be1be47
--- /dev/null
+++ b/faster_rcnn/function/train_rpn.py
@@ -0,0 +1,131 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+import argparse
+import logging
+import pprint
+import mxnet as mx
+
+from symbols import *
+from core import callback, metric
+from core.loader import AnchorLoader
+from core.module import MutableModule
+from utils.load_data import load_gt_roidb, merge_roidb, filter_roidb
+from utils.load_model import load_param
+from utils.PrefetchingIter import PrefetchingIter
+from utils.lr_scheduler import WarmupMultiFactorScheduler
+
+
+def train_rpn(cfg, dataset, image_set, root_path, dataset_path,
+              frequent, kvstore, flip, shuffle, resume,
+              ctx, pretrained, epoch, prefix, begin_epoch, end_epoch,
+              train_shared, lr, lr_step, logger=None, output_path=None):
+    # set up logger
+    if not logger:
+        logging.basicConfig()
+        logger = logging.getLogger()
+        logger.setLevel(logging.INFO)
+
+    # set up config
+    cfg.TRAIN.BATCH_IMAGES = cfg.TRAIN.ALTERNATE.RPN_BATCH_IMAGES
+
+    # load symbol
+    sym_instance = eval(cfg.symbol + '.' + cfg.symbol)()
+    sym = sym_instance.get_symbol_rpn(cfg, is_train=True)
+    feat_sym = sym.get_internals()['rpn_cls_score_output']
+
+    # setup multi-gpu
+    batch_size = len(ctx)
+    input_batch_size = cfg.TRAIN.BATCH_IMAGES * batch_size
+
+    # print cfg
+    pprint.pprint(cfg)
+    logger.info('training rpn cfg:{}\n'.format(pprint.pformat(cfg)))
+
+    # load dataset and prepare imdb for training
+    image_sets = [iset for iset in image_set.split('+')]
+    roidbs = [load_gt_roidb(dataset, image_set, root_path, dataset_path, result_path=output_path,
+                            flip=flip)
+              for image_set in image_sets]
+    roidb = merge_roidb(roidbs)
+    roidb = filter_roidb(roidb, cfg)
+
+    # load training data
+    train_data = AnchorLoader(feat_sym, roidb, cfg, batch_size=input_batch_size, shuffle=shuffle,
+                              ctx=ctx, feat_stride=cfg.network.RPN_FEAT_STRIDE, anchor_scales=cfg.network.ANCHOR_SCALES,
+                              anchor_ratios=cfg.network.ANCHOR_RATIOS, aspect_grouping=cfg.TRAIN.ASPECT_GROUPING)
+
+    # infer max shape
+    max_data_shape = [('data', (cfg.TRAIN.BATCH_IMAGES, 3, max([v[0] for v in cfg.SCALES]), max([v[1] for v in cfg.SCALES])))]
+    max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
+    print('providing maximum shape', max_data_shape, max_label_shape)
+
+    # infer shape
+    data_shape_dict = dict(train_data.provide_data_single + train_data.provide_label_single)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # load and initialize params
+    if resume:
+        print('continue training from ', begin_epoch)
+        arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
+    else:
+        arg_params, aux_params = load_param(pretrained, epoch, convert=True)
+        sym_instance.init_weight_rpn(cfg, arg_params, aux_params)
+
+    # check parameter shapes
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict)
+
+    # create solver
+    data_names = [k[0] for k in train_data.provide_data_single]
+    label_names = [k[0] for k in train_data.provide_label_single]
+    if train_shared:
+        fixed_param_prefix = cfg.network.FIXED_PARAMS_SHARED
+    else:
+        fixed_param_prefix = cfg.network.FIXED_PARAMS
+    mod = MutableModule(sym, data_names=data_names, label_names=label_names,
+                        logger=logger, context=ctx, max_data_shapes=[max_data_shape for _ in xrange(batch_size)],
+                        max_label_shapes=[max_label_shape for _ in xrange(batch_size)], fixed_param_prefix=fixed_param_prefix)
+
+    # decide training params
+    # metric
+    eval_metric = metric.RPNAccMetric()
+    cls_metric = metric.RPNLogLossMetric()
+    bbox_metric = metric.RPNL1LossMetric()
+    eval_metrics = mx.metric.CompositeEvalMetric()
+    for child_metric in [eval_metric, cls_metric, bbox_metric]:
+        eval_metrics.add(child_metric)
+    # callback
+    batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=frequent)
+    # epoch_end_callback = mx.callback.do_checkpoint(prefix)
+    epoch_end_callback = mx.callback.module_checkpoint(mod, prefix, period=1, save_optimizer_states=True)
+    # decide learning rate
+    base_lr = lr
+    lr_factor = cfg.TRAIN.lr_factor
+    lr_epoch = [int(epoch) for epoch in lr_step.split(',')]
+    lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
+    lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
+    lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
+    print('lr', lr, 'lr_epoch_diff', lr_epoch_diff, 'lr_iters', lr_iters)
+    lr_scheduler = WarmupMultiFactorScheduler(lr_iters, lr_factor, cfg.TRAIN.warmup, cfg.TRAIN.warmup_lr, cfg.TRAIN.warmup_step)
+    # optimizer
+    optimizer_params = {'momentum': cfg.TRAIN.momentum,
+                        'wd': cfg.TRAIN.wd,
+                        'learning_rate': lr,
+                        'lr_scheduler': lr_scheduler,
+                        'rescale_grad': 1.0,
+                        'clip_gradient': None}
+
+    if not isinstance(train_data, PrefetchingIter):
+        train_data = PrefetchingIter(train_data)
+
+    # train
+    mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
+            batch_end_callback=batch_end_callback, kvstore=kvstore,
+            optimizer='sgd', optimizer_params=optimizer_params,
+            arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
+
diff --git a/faster_rcnn/operator_cxx/deformable_convolution-inl.h b/faster_rcnn/operator_cxx/deformable_convolution-inl.h
new file mode 100644
index 0000000..ccc6bb3
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_convolution-inl.h
@@ -0,0 +1,487 @@
+/*!
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_convolution-inl.h
+ * \brief
+ * \ref: https://github.com/Yangqing/caffe/wiki/Convolution-in-Caffe:-a-memo
+ * \ref: https://arxiv.org/abs/1703.06211
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
+*/
+#ifndef MXNET_OPERATOR_DEFORMABLE_CONVOLUTION_INL_H_
+#define MXNET_OPERATOR_DEFORMABLE_CONVOLUTION_INL_H_
+
+#include <mxnet/io.h>
+#include <mxnet/base.h>
+#include <mxnet/ndarray.h>
+#include <mxnet/operator.h>
+#include <mxnet/operator_util.h>
+#include <dmlc/logging.h>
+#include <dmlc/optional.h>
+#include <algorithm>
+#include <map>
+#include <vector>
+#include <string>
+#include <utility>
+#include "../operator_common.h"
+#include "../nn/im2col.h"
+#include "./nn/deformable_im2col.h"
+
+
+namespace mxnet {
+  namespace op {
+
+    namespace conv {
+      enum DeformableConvolutionOpInputs { kData, kOffset, kWeight, kBias };
+      enum DeformableConvolutionOpOutputs { kOut };
+      enum DeformableConvolutionOpResource { kTempSpace };
+    }
+
+    struct DeformableConvolutionParam : public dmlc::Parameter<DeformableConvolutionParam> {
+      TShape kernel;
+      TShape stride;
+      TShape dilate;
+      TShape pad;
+      uint32_t num_filter;
+      uint32_t num_group;
+      uint32_t num_deformable_group;
+      uint64_t workspace;
+      bool no_bias;
+      dmlc::optional<int> layout;
+      DMLC_DECLARE_PARAMETER(DeformableConvolutionParam) {
+        DMLC_DECLARE_FIELD(kernel).describe("convolution kernel size: (h, w) or (d, h, w)");
+        DMLC_DECLARE_FIELD(stride).set_default(TShape())
+          .describe("convolution stride: (h, w) or (d, h, w)");
+        DMLC_DECLARE_FIELD(dilate).set_default(TShape())
+          .describe("convolution dilate: (h, w) or (d, h, w)");
+        DMLC_DECLARE_FIELD(pad).set_default(TShape())
+          .describe("pad for convolution: (h, w) or (d, h, w)");
+        DMLC_DECLARE_FIELD(num_filter).set_range(1, 100000)
+          .describe("convolution filter(channel) number");
+        DMLC_DECLARE_FIELD(num_group).set_default(1)
+          .describe("Number of group partitions.");
+        DMLC_DECLARE_FIELD(num_deformable_group).set_default(1)
+          .describe("Number of deformable group partitions.");
+        DMLC_DECLARE_FIELD(workspace).set_default(1024).set_range(0, 8192)
+          .describe("Maximum temperal workspace allowed for convolution (MB).");
+        DMLC_DECLARE_FIELD(no_bias).set_default(false)
+          .describe("Whether to disable bias parameter.");
+        DMLC_DECLARE_FIELD(layout)
+          .add_enum("NCW", mshadow::kNCW)
+          .add_enum("NCHW", mshadow::kNCHW)
+          .add_enum("NCDHW", mshadow::kNCDHW)
+          .set_default(dmlc::optional<int>())
+          .describe("Set layout for input, output and weight. Empty for\n    "
+            "default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d.");
+      }
+    };
+
+    template<typename xpu, typename DType>
+    class DeformableConvolutionOp : public Operator {
+    public:
+      explicit DeformableConvolutionOp(DeformableConvolutionParam p) {
+        this->param_ = p;
+        // convert MBytes first to Bytes and then to elements.
+        param_.workspace = (param_.workspace << 20) / sizeof(DType);
+        CHECK(param_.layout.value() == mshadow::kNCW ||
+          param_.layout.value() == mshadow::kNCHW ||
+          param_.layout.value() == mshadow::kNCDHW)
+          << "Only support NCW, NCHW and NCDHW layout";
+      }
+
+      virtual void Forward(const OpContext &ctx,
+        const std::vector<TBlob> &in_data,
+        const std::vector<OpReqType> &req,
+        const std::vector<TBlob> &out_data,
+        const std::vector<TBlob> &aux_args) {
+        using namespace mshadow;
+        using namespace mshadow::expr;
+        CHECK_EQ(req[conv::kOut], kWriteTo);
+        size_t expected = param_.no_bias ? 3 : 4;
+        CHECK_EQ(in_data.size(), expected);
+        CHECK_EQ(out_data.size(), 1U);
+        LayerSetUp(in_data[conv::kData].shape_, in_data[conv::kOffset].shape_, out_data[conv::kOut].shape_);
+        Stream<xpu>* s = ctx.get_stream<xpu>();
+        // allocate workspace for col_buffer
+        Tensor<xpu, 1, DType> workspace = ctx.requested[conv::kTempSpace]
+          .get_space_typed<xpu, 1, DType>(Shape1(col_buffer_size_), s);
+        // calculate the shape of col_buffer
+        TShape col_buffer_shape(num_spatial_axes_ + 1);
+        col_buffer_shape[0] = conv_in_channels_ * param_.kernel.Size();
+        for (index_t i = 1; i < col_buffer_shape.ndim(); ++i) {
+          col_buffer_shape[i] = out_data[0].shape_[i + 1];
+        }
+        // create a column buffer using workspace and col_buffer_shape
+        TBlob col_buffer(workspace.dptr_, col_buffer_shape, xpu::kDevMask, DataType<DType>::kFlag);
+
+        // initialize weight and col_buffer 3D tensors for using gemm
+        index_t M = conv_out_channels_ / group_;
+        index_t N = conv_out_spatial_dim_;
+        index_t K = kernel_dim_;
+        Tensor<xpu, 3, DType> weight_3d = in_data[conv::kWeight].get_with_shape<xpu, 3, DType>(
+          Shape3(group_, M, K), s);
+        Tensor<xpu, 3, DType> col_buffer_3d = col_buffer.get_with_shape<xpu, 3, DType>(
+          Shape3(group_, K, N), s);
+        Tensor<xpu, 4, DType> output_4d = out_data[conv::kOut].get_with_shape<xpu, 4, DType>(
+          Shape4(num_, group_, M, N), s);
+        for (index_t n = 0; n < num_; ++n) {
+          // transform image to col_buffer in order to use gemm
+          deformable_im2col(s, in_data[conv::kData].dptr<DType>() + n*input_dim_, 
+            in_data[conv::kOffset].dptr<DType>() + n*input_offset_dim_, in_data[conv::kData].shape_,
+            col_buffer.shape_, param_.kernel, param_.pad, param_.stride, param_.dilate, param_.num_deformable_group,
+            col_buffer.dptr<DType>());
+          Tensor<xpu, 3, DType> output_3d = output_4d[n];
+          for (index_t g = 0; g < group_; ++g) {
+            ASSIGN_DISPATCH(output_3d[g], req[conv::kOut], dot(weight_3d[g], col_buffer_3d[g]));
+          }
+        }
+        if (bias_term_) {
+          Tensor<xpu, 1, DType> bias = in_data[conv::kBias].get<xpu, 1, DType>(s);
+          Tensor<xpu, 3, DType> output_3d = out_data[conv::kOut].get_with_shape<xpu, 3, DType>(
+            Shape3(num_, conv_out_channels_, conv_out_spatial_dim_), s);
+          // has bias term, broadcast it to the same shape of output_3d in channel dim
+          output_3d += mshadow::expr::broadcast<1>(bias, output_3d.shape_);
+        }
+      }
+
+      virtual void Backward(const OpContext &ctx,
+        const std::vector<TBlob>& out_grad,
+        const std::vector<TBlob>& in_data,
+        const std::vector<TBlob>& out_data,
+        const std::vector<OpReqType>& req,
+        const std::vector<TBlob>& in_grad,
+        const std::vector<TBlob>& aux_args) {
+        using namespace mshadow;
+        using namespace mshadow::expr;
+        CHECK_EQ(out_grad.size(), 1U);
+        size_t expected = param_.no_bias == 0 ? 4 : 3;
+        CHECK(in_data.size() == expected && in_grad.size() == expected);
+        CHECK_EQ(req.size(), expected);
+        CHECK_EQ(in_data[conv::kWeight].CheckContiguous(), true);
+        LayerSetUp(in_grad[conv::kData].shape_, in_grad[conv::kOffset].shape_, out_grad[conv::kOut].shape_);
+        Stream<xpu> *s = ctx.get_stream<xpu>();
+        // allocate workspace for col_buffer
+        Tensor<xpu, 1, DType> workspace = ctx.requested[conv::kTempSpace]
+          .get_space_typed<xpu, 1, DType>(Shape1(col_buffer_size_), s);
+        // calculate the shape of col_buffer
+        TShape col_buffer_shape(num_spatial_axes_ + 1);
+        col_buffer_shape[0] = conv_in_channels_ * param_.kernel.Size();
+        for (index_t i = 1; i < col_buffer_shape.ndim(); ++i) {
+          col_buffer_shape[i] = out_grad[conv::kData].shape_[i + 1];
+        }
+        // create a column buffer using workspace and col_buffer_shape
+        TBlob col_buffer(workspace.dptr_, col_buffer_shape, xpu::kDevMask, DataType<DType>::kFlag);
+
+        // initialize weight and col_buffer 3D tensors for using gemm
+        // For computing dLoss/d(in_data[kData])
+        index_t M = kernel_dim_;
+        index_t N = conv_out_spatial_dim_;
+        index_t K = conv_out_channels_ / group_;
+        Tensor<xpu, 3, DType> weight_3d = in_data[conv::kWeight].get_with_shape<xpu, 3, DType>(
+          Shape3(group_, K, M), s);
+        Tensor<xpu, 4, DType> out_grad_4d = out_grad[conv::kOut].get_with_shape<xpu, 4, DType>(
+          Shape4(num_, group_, K, N), s);
+        Tensor<xpu, 3, DType> col_buffer_3d = col_buffer.get_with_shape<xpu, 3, DType>(
+          Shape3(group_, M, N), s);
+        // For computing dLoss/dWeight
+        Tensor<xpu, 3, DType> dweight_3d = in_grad[conv::kWeight].get_with_shape<xpu, 3, DType>(
+          Shape3(group_, K, M), s);
+
+        Tensor<xpu, 1, DType> data_grad = in_grad[conv::kData].FlatTo1D<xpu, DType>(s);
+        data_grad = 0;
+
+
+        for (index_t n = 0; n < num_; ++n) {
+          Tensor<xpu, 3, DType> out_grad_3d = out_grad_4d[n];
+          for (index_t g = 0; g < group_; ++g) {
+            col_buffer_3d[g] = dot(weight_3d[g].T(), out_grad_3d[g]);
+          }
+
+          // gradient w.r.t. input coordinate data
+          deformable_col2im_coord(s, col_buffer.dptr<DType>(),
+            in_data[conv::kData].dptr<DType>() + n*input_dim_, in_data[conv::kOffset].dptr<DType>() + n*input_offset_dim_,
+            in_grad[conv::kData].shape_, col_buffer.shape_,
+            param_.kernel, param_.pad, param_.stride, param_.dilate, param_.num_deformable_group,
+            in_grad[conv::kOffset].dptr<DType>() + n*input_offset_dim_,
+            req[conv::kData]);
+
+          // gradient w.r.t. input data
+          deformable_col2im(s, col_buffer.dptr<DType>(),
+            in_data[conv::kOffset].dptr<DType>() + n*input_offset_dim_, in_grad[conv::kData].shape_, col_buffer.shape_,
+            param_.kernel, param_.pad, param_.stride, param_.dilate, param_.num_deformable_group,
+            in_grad[conv::kData].dptr<DType>() + n*input_dim_,
+            req[conv::kData]);
+
+          // gradient w.r.t. weight, dWeight should accumulate across the batch and group
+          im2col(s, in_data[conv::kData].dptr<DType>() + n*input_dim_, in_data[conv::kData].shape_,
+            col_buffer.shape_, param_.kernel, param_.pad, param_.stride, param_.dilate,
+            col_buffer.dptr<DType>());
+          for (index_t g = 0; g < group_; ++g) {
+            if (0 == n) {
+              ASSIGN_DISPATCH(dweight_3d[g], req[conv::kWeight],
+                dot(out_grad_3d[g], col_buffer_3d[g].T()));
+            }
+            else {
+              dweight_3d[g] += dot(out_grad_3d[g], col_buffer_3d[g].T());
+            }
+          }
+        }
+
+        // gradient w.r.t bias
+        if (bias_term_) {
+          Tensor<xpu, 1, DType> dbias = in_grad[conv::kBias].get<xpu, 1, DType>(s);
+          Tensor<xpu, 3, DType> dout = out_grad[conv::kOut].get_with_shape<xpu, 3, DType>(
+            Shape3(num_, conv_out_channels_, conv_out_spatial_dim_), s);
+          ASSIGN_DISPATCH(dbias, req[conv::kBias], sumall_except_dim<1>(dout));
+        }
+
+      }
+
+    private:
+      void LayerSetUp(const TShape& ishape, const TShape& offset_shape, const TShape& oshape) {
+        channel_axis_ = 1;  // hard code channel axis
+        const index_t first_spatial_axis = channel_axis_ + 1;
+        const index_t num_axes = param_.kernel.ndim() + 2;
+        num_spatial_axes_ = num_axes - first_spatial_axis;
+        is_1x1_ = true;
+        for (index_t i = 0; i < param_.kernel.ndim(); ++i) {
+          is_1x1_ &= param_.kernel[i] == 1 && param_.stride[i] == 1 && param_.pad[i] == 0;
+          if (!is_1x1_) break;
+        }
+
+        // batch size
+        num_ = ishape[0];
+        // number of input channels
+        channels_ = ishape[1];
+        group_ = param_.num_group;
+        conv_out_channels_ = param_.num_filter;
+        conv_in_channels_ = channels_;
+        bias_term_ = !param_.no_bias;
+        kernel_dim_ = conv_in_channels_ / group_ * param_.kernel.Size();
+        weight_offset_ = conv_out_channels_ * kernel_dim_ / group_;
+        conv_out_spatial_dim_ = oshape.ProdShape(2, oshape.ndim());
+        col_offset_ = kernel_dim_ * conv_out_spatial_dim_;
+        output_offset_ = conv_out_channels_ * conv_out_spatial_dim_ / group_;
+        // size of the column buffer used for storing im2col-ed pixels
+        col_buffer_size_ = kernel_dim_ * group_ * conv_out_spatial_dim_;
+        // input/output image size (#channels * height * width)
+        input_dim_ = ishape.ProdShape(1, ishape.ndim());
+				input_offset_dim_ = ishape.ProdShape(1, offset_shape.ndim());
+        output_dim_ = oshape.ProdShape(1, oshape.ndim());
+        num_kernels_im2col_ = conv_in_channels_ * conv_out_spatial_dim_;
+        num_kernels_col2im_ = input_dim_;
+      }
+
+    private:
+      DeformableConvolutionParam param_;
+      index_t channel_axis_;  // channel axis of the input
+      index_t channels_;  // number of channels of input image
+      index_t num_spatial_axes_;  // number of spatial axes
+      index_t num_;  // batch size
+      index_t group_;  // number of groups
+      index_t conv_out_channels_;  // number of output channels (num_filter)
+      index_t conv_out_spatial_dim_;  // number of pixels of output images per channel
+      index_t conv_in_channels_;  // number of input channels
+      index_t kernel_dim_;  // number of input channels per group * kernel size
+      index_t weight_offset_;  // number of output channels per group * kernel_dim_
+      index_t col_offset_;
+      index_t output_offset_;
+      index_t col_buffer_size_;
+      index_t input_dim_;
+			index_t input_offset_dim_;
+      index_t output_dim_;
+      index_t num_kernels_im2col_;
+      index_t num_kernels_col2im_;
+      bool bias_term_;  // has bias term?
+      bool is_1x1_;
+    };  // class ConvolutionOp
+
+    template<typename xpu>
+    Operator* CreateOp(DeformableConvolutionParam param, int dtype,
+      std::vector<TShape> *in_shape,
+      std::vector<TShape> *out_shape,
+      Context ctx);
+
+#if DMLC_USE_CXX11
+    class DeformableConvolutionProp : public OperatorProperty {
+    public:
+      std::vector<std::string> ListArguments() const override {
+        if (!param_.no_bias) {
+          return{ "data", "offset", "weight", "bias" };
+        }
+        else {
+          return{ "data", "offset", "weight" };
+        }
+      }
+
+      void Init(const std::vector<std::pair<std::string, std::string> >& kwargs) override {
+        using namespace mshadow;
+        param_.Init(kwargs);
+        if (param_.kernel.ndim() == 2) {
+          param_.layout = param_.layout ? param_.layout.value() : mshadow::kNCHW;
+          if (param_.stride.ndim() == 0) param_.stride = Shape2(1, 1);
+          if (param_.dilate.ndim() == 0) param_.dilate = Shape2(1, 1);
+          if (param_.pad.ndim() == 0) param_.pad = Shape2(0, 0);
+        }
+        else {
+          LOG(FATAL) << "not implemented";
+        }
+      }
+
+      std::map<std::string, std::string> GetParams() const override {
+        return param_.__DICT__();
+      }
+
+      bool InferShape(std::vector<TShape> *in_shape,
+        std::vector<TShape> *out_shape,
+        std::vector<TShape> *aux_shape) const override {
+        using namespace mshadow;
+        if (!param_.no_bias) {
+          CHECK_EQ(in_shape->size(), 4U) << "Input:[data, offset, weight, bias]";
+        }
+        else {
+          CHECK_EQ(in_shape->size(), 3U) << "Input:[data, offset, weight]";
+        }
+        out_shape->resize(1, TShape());
+        const TShape &dshp = (*in_shape)[conv::kData];
+        const TShape &oshp = (*in_shape)[conv::kOffset];
+        if (dshp.ndim() == 0) return false;
+        if (param_.kernel.ndim() == 2) {
+          // 2d conv
+          CHECK_EQ(dshp.ndim(), 4U) \
+            << "Input data should be 4D in batch-num_filter-y-x";
+          CHECK_EQ(oshp.ndim(), 4U) \
+            << "Input offset should be 4D in batch-num_filter-y-x";
+          Shape<4> dshape = ConvertLayout(dshp.get<4>(), param_.layout.value(), kNCHW);
+          Shape<4> offsetshape = ConvertLayout(oshp.get<4>(), param_.layout.value(), kNCHW);
+          Shape<4> wshape = Shape4(param_.num_filter / param_.num_group, dshape[1] / param_.num_group,
+            param_.kernel[0], param_.kernel[1]);
+          wshape = ConvertLayout(wshape, kNCHW, param_.layout.value());
+          wshape[0] *= param_.num_group;
+          SHAPE_ASSIGN_CHECK(*in_shape, conv::kWeight, wshape);
+          if (!param_.no_bias) {
+            SHAPE_ASSIGN_CHECK(*in_shape, conv::kBias, Shape1(param_.num_filter));
+          }
+
+          const index_t ksize_y = static_cast<index_t>(param_.kernel[0]);
+          const index_t ksize_x = static_cast<index_t>(param_.kernel[1]);
+          CHECK_EQ(dshape[1] % param_.num_group, 0U) \
+            << "input num_filter must divide group size";
+          CHECK_EQ(dshape[1] % param_.num_deformable_group, 0U) \
+            << "input num_filter must divide deformable group size";
+          CHECK_EQ(param_.num_filter % param_.num_group, 0U) \
+            << "output num_filter must divide group size";
+          CHECK_GT(param_.kernel.Size(), 0U) \
+            << "incorrect kernel size: " << param_.kernel;
+          CHECK_GT(param_.stride.Size(), 0U) \
+            << "incorrect stride size: " << param_.stride;
+          CHECK_GT(param_.dilate.Size(), 0U) \
+            << "incorrect dilate size: " << param_.dilate;
+          Shape<4> oshape;
+          oshape[0] = dshape[0];
+          oshape[1] = param_.num_filter;
+          oshape[2] = (dshape[2] + 2 * param_.pad[0] -
+            (param_.dilate[0] * (ksize_y - 1) + 1)) / param_.stride[0] + 1;
+          oshape[3] = (dshape[3] + 2 * param_.pad[1] -
+            (param_.dilate[1] * (ksize_x - 1) + 1)) / param_.stride[1] + 1;
+          SHAPE_ASSIGN_CHECK(*out_shape, 0, ConvertLayout(oshape, kNCHW, param_.layout.value()));
+          CHECK_EQ(oshape[1] % param_.num_deformable_group, 0U) \
+            << "output num_filter must divide deformable group size";
+          CHECK_EQ(oshape[2], offsetshape[2]) \
+            << "output height must equal to offset map height";
+          CHECK_EQ(oshape[3], offsetshape[3]) \
+            << "output width must equal to offset map width";
+          CHECK_EQ(offsetshape[1] % (param_.kernel[0] * param_.kernel[1]), 0U) \
+            << "offset filter must divide deformable group size";
+          CHECK_EQ(offsetshape[1] / (2 * param_.kernel[0] * param_.kernel[1]), param_.num_deformable_group) \
+            << "offset filter must divide deformable group size";
+          // Perform incomplete shape inference. Fill in the missing values in data shape.
+          // 1) We can always fill in the batch_size.
+          // 2) We can back-calculate the input height/width if the corresponding stride is 1.
+          oshape = ConvertLayout((*out_shape)[0].get<4>(), param_.layout.value(), kNCHW);
+          dshape[0] = oshape[0];
+          if (param_.stride[0] == 1) {
+            dshape[2] = oshape[2] + param_.dilate[0] * (ksize_y - 1) - 2 * param_.pad[0];
+          }
+          if (param_.stride[1] == 1) {
+            dshape[3] = oshape[3] + param_.dilate[1] * (ksize_x - 1) - 2 * param_.pad[1];
+          }
+          SHAPE_ASSIGN_CHECK(*in_shape, conv::kData,
+            ConvertLayout(dshape, kNCHW, param_.layout.value()));
+          // Check whether the kernel sizes are valid
+          if (dshape[2] != 0) {
+            CHECK_LE(ksize_y, dshape[2] + 2 * param_.pad[0]) << "kernel size exceed input";
+          }
+          if (dshape[3] != 0) {
+            CHECK_LE(ksize_x, dshape[3] + 2 * param_.pad[1]) << "kernel size exceed input";
+          }
+          return true;
+        }
+        else {
+          LOG(FATAL) << "not implemented";
+          return false;
+        }
+      }
+
+      bool InferType(std::vector<int> *in_type,
+        std::vector<int> *out_type,
+        std::vector<int> *aux_type) const override {
+        CHECK_GE(in_type->size(), 1U);
+        int dtype = (*in_type)[0];
+        CHECK_NE(dtype, -1) << "First input must have specified type";
+        for (index_t i = 0; i < in_type->size(); ++i) {
+          if ((*in_type)[i] == -1) {
+            (*in_type)[i] = dtype;
+          }
+          else {
+            CHECK_EQ((*in_type)[i], dtype) << "This layer requires uniform type. "
+              << "Expected " << dtype << " v.s. given "
+              << (*in_type)[i] << " at " << ListArguments()[i];
+          }
+        }
+        out_type->clear();
+        out_type->push_back(dtype);
+        return true;
+      }
+
+      OperatorProperty* Copy() const override {
+        auto ptr = new DeformableConvolutionProp();
+        ptr->param_ = param_;
+        return ptr;
+      }
+
+      std::string TypeString() const override {
+        return "_contrib_DeformableConvolution";
+      }
+
+      std::vector<int> DeclareBackwardDependency(
+        const std::vector<int> &out_grad,
+        const std::vector<int> &in_data,
+        const std::vector<int> &out_data) const override {
+        return{ out_grad[conv::kOut], in_data[conv::kData], in_data[conv::kOffset], in_data[conv::kWeight] };
+      }
+
+      std::vector<ResourceRequest> ForwardResource(
+        const std::vector<TShape> &in_shape) const override {
+        return{ ResourceRequest::kTempSpace };
+      }
+
+      std::vector<ResourceRequest> BackwardResource(
+        const std::vector<TShape> &in_shape) const override {
+        return{ ResourceRequest::kTempSpace };
+      }
+
+      Operator* CreateOperator(Context ctx) const override {
+        LOG(FATAL) << "Not Implemented.";
+        return NULL;
+      }
+
+      Operator* CreateOperatorEx(Context ctx, std::vector<TShape> *in_shape,
+        std::vector<int> *in_type) const override;
+
+    private:
+      DeformableConvolutionParam param_;
+    };  // class ConvolutionProp
+#endif  // DMLC_USE_CXX11
+  }  // namespace op
+}  // namespace mxnet
+#endif  // MXNET_OPERATOR_CONVOLUTION_INL_H_
diff --git a/faster_rcnn/operator_cxx/deformable_convolution.cc b/faster_rcnn/operator_cxx/deformable_convolution.cc
new file mode 100644
index 0000000..a5916a5
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_convolution.cc
@@ -0,0 +1,89 @@
+/*!
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_convolution.cc
+ * \brief
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
+*/
+
+#include "./deformable_convolution-inl.h"
+
+namespace mxnet {
+namespace op {
+DMLC_REGISTER_PARAMETER(DeformableConvolutionParam);
+
+template<>
+Operator* CreateOp<cpu>(DeformableConvolutionParam param, int dtype,
+                        std::vector<TShape> *in_shape,
+                        std::vector<TShape> *out_shape,
+                        Context ctx) {
+  Operator *op = NULL;
+  MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+    op = new DeformableConvolutionOp<cpu, DType>(param);
+  })
+  return op;
+}
+
+// DO_BIND_DISPATCH comes from operator_common.h
+Operator *DeformableConvolutionProp::CreateOperatorEx(Context ctx,
+                                            std::vector<TShape> *in_shape,
+                                            std::vector<int> *in_type) const {
+  std::vector<TShape> out_shape, aux_shape;
+  std::vector<int> out_type, aux_type;
+  CHECK(InferType(in_type, &out_type, &aux_type));
+  CHECK(InferShape(in_shape, &out_shape, &aux_shape));
+  DO_BIND_DISPATCH(CreateOp, param_, (*in_type)[0], in_shape, &out_shape, ctx);
+}
+
+MXNET_REGISTER_OP_PROPERTY(_contrib_DeformableConvolution, DeformableConvolutionProp)
+.describe(R"code(Compute *N*-D convolution on *(N+2)*-D input.
+
+In the 2-D convolution, given input data with shape *(batch_size,
+channel, height, width)*, the output is computed by
+
+.. math::
+
+   out[n,i,:,:] = bias[i] + \sum_{j=0}^{num\_filter} data[n,j,:,:] \star
+   weight[i,j,:,:]
+
+where :math:`\star` is the 2-D cross-correlation operator.
+
+For general 2-D convolution, the shapes are
+
+- **data**: *(batch_size, channel, height, width)*
+- **weight**: *(num_filter, channel, kernel[0], kernel[1])*
+- **bias**: *(num_filter,)*
+- **out**: *(batch_size, num_filter, out_height, out_width)*.
+
+Define::
+
+  f(x,k,p,s,d) = floor((x+2*p-d*(k-1)-1)/s)+1
+
+then we have::
+
+  out_height=f(height, kernel[0], pad[0], stride[0], dilate[0])
+  out_width=f(width, kernel[1], pad[1], stride[1], dilate[1])
+
+If ``no_bias`` is set to be true, then the ``bias`` term is ignored.
+
+The default data ``layout`` is *NCHW*, namely *(batch_size, channle, height,
+width)*. We can choose other layouts such as *NHWC*.
+
+If ``num_group`` is larger than 1, denoted by *g*, then split the input ``data``
+evenly into *g* parts along the channel axis, and also evenly split ``weight``
+along the first dimension. Next compute the convolution on the *i*-th part of
+the data with the *i*-th weight part. The output is obtained by concating all
+the *g* results.
+
+Both ``weight`` and ``bias`` are learnable parameters.
+
+
+)code" ADD_FILELINE)
+.add_argument("data", "NDArray-or-Symbol", "Input data to the DeformableConvolutionOp.")
+.add_argument("offset", "NDArray-or-Symbol", "Input offset to the DeformableConvolutionOp.")
+.add_argument("weight", "NDArray-or-Symbol", "Weight matrix.")
+.add_argument("bias", "NDArray-or-Symbol", "Bias parameter.")
+.add_arguments(DeformableConvolutionParam::__FIELDS__());
+
+}  // namespace op
+}  // namespace mxnet
diff --git a/faster_rcnn/operator_cxx/deformable_convolution.cu b/faster_rcnn/operator_cxx/deformable_convolution.cu
new file mode 100644
index 0000000..59948fd
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_convolution.cu
@@ -0,0 +1,29 @@
+/*!
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_convolution.cu
+ * \brief
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
+*/
+
+#include "./deformable_convolution-inl.h"
+#include <vector>
+
+namespace mxnet {
+  namespace op {
+
+    template<>
+    Operator* CreateOp<gpu>(DeformableConvolutionParam param, int dtype,
+      std::vector<TShape> *in_shape,
+      std::vector<TShape> *out_shape,
+      Context ctx) {
+      Operator *op = NULL;
+      MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+        op = new DeformableConvolutionOp<gpu, DType>(param);
+      })
+        return op;
+    }
+
+  }  // namespace op
+}  // namespace mxnet
+
diff --git a/faster_rcnn/operator_cxx/deformable_psroi_pooling-inl.h b/faster_rcnn/operator_cxx/deformable_psroi_pooling-inl.h
new file mode 100644
index 0000000..7226299
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_psroi_pooling-inl.h
@@ -0,0 +1,280 @@
+/*!
+* Copyright (c) 2017 Microsoft
+* Licensed under The Apache-2.0 License [see LICENSE for details]
+* \file deformable_psroi_pooling-inl.h
+* \brief deformable psroi pooling operator and symbol
+* \author Yi Li, Guodong Zhang, Jifeng Dai
+*/
+#ifndef MXNET_OPERATOR_DEFORMABLE_PSROI_POOLING_INL_H_
+#define MXNET_OPERATOR_DEFORMABLE_PSROI_POOLING_INL_H_
+
+#include <dmlc/logging.h>
+#include <dmlc/parameter.h>
+#include <mxnet/operator.h>
+#include <map>
+#include <vector>
+#include <string>
+#include <utility>
+#include "../mshadow_op.h"
+#include "../operator_common.h"
+
+
+namespace mxnet {
+  namespace op {
+
+    // Declare enumeration of input order to make code more intuitive.
+    // These enums are only visible within this header
+    namespace deformablepsroipool {
+      enum DeformablePSROIPoolingOpInputs { kData, kBox, kTrans };
+      enum DeformablePSROIPoolingOpOutputs { kOut, kTopCount };
+    }  // deformablepsroipool
+
+    struct DeformablePSROIPoolingParam : public dmlc::Parameter<DeformablePSROIPoolingParam> {
+      // TShape pooled_size;
+      float spatial_scale;
+      int output_dim;
+      int group_size;
+      int pooled_size;
+      int part_size;
+      int sample_per_part;
+      float trans_std;
+      bool no_trans;
+      DMLC_DECLARE_PARAMETER(DeformablePSROIPoolingParam) {
+        DMLC_DECLARE_FIELD(spatial_scale).set_range(0.0, 1.0)
+          .describe("Ratio of input feature map height (or w) to raw image height (or w). "
+            "Equals the reciprocal of total stride in convolutional layers");
+        DMLC_DECLARE_FIELD(output_dim).describe("fix output dim");
+        DMLC_DECLARE_FIELD(group_size).describe("fix group size");
+        DMLC_DECLARE_FIELD(pooled_size).describe("fix pooled size");
+        DMLC_DECLARE_FIELD(part_size).set_default(0).describe("fix part size");
+        DMLC_DECLARE_FIELD(sample_per_part).set_default(1).describe("fix samples per part");
+        DMLC_DECLARE_FIELD(trans_std).set_default(0.0).set_range(0.0, 1.0).describe("fix transition std");
+        DMLC_DECLARE_FIELD(no_trans).set_default(false)
+          .describe("Whether to disable trans parameter.");
+      }
+    };
+
+    template<typename xpu, typename DType>
+    class DeformablePSROIPoolingOp : public Operator {
+    public:
+      explicit DeformablePSROIPoolingOp(DeformablePSROIPoolingParam p) {
+        this->param_ = p;
+      }
+
+      virtual void Forward(const OpContext &ctx,
+        const std::vector<TBlob> &in_data,
+        const std::vector<OpReqType> &req,
+        const std::vector<TBlob> &out_data,
+        const std::vector<TBlob> &aux_args) {
+        using namespace mshadow;
+        size_t in_expected = param_.no_trans? 2 : 3;
+        size_t out_expected = 2;
+        CHECK_EQ(in_data.size(), in_expected);
+        CHECK_EQ(out_data.size(), out_expected);
+        CHECK_EQ(out_data[deformablepsroipool::kOut].shape_[0], in_data[deformablepsroipool::kBox].shape_[0]);
+        CHECK_EQ(out_data[deformablepsroipool::kTopCount].shape_[0], in_data[deformablepsroipool::kBox].shape_[0]);
+        Stream<xpu> *s = ctx.get_stream<xpu>();
+
+        Tensor<xpu, 4, DType> data = in_data[deformablepsroipool::kData].get<xpu, 4, DType>(s);
+        Tensor<xpu, 2, DType> bbox = in_data[deformablepsroipool::kBox].get<xpu, 2, DType>(s);
+        Tensor<xpu, 4, DType> out = out_data[deformablepsroipool::kOut].get<xpu, 4, DType>(s);
+        Tensor<xpu, 4, DType> top_count = out_data[deformablepsroipool::kTopCount].get<xpu, 4, DType>(s);
+        CHECK_EQ(data.CheckContiguous(), true);
+        CHECK_EQ(bbox.CheckContiguous(), true);
+        CHECK_EQ(out.CheckContiguous(), true);
+        CHECK_EQ(top_count.CheckContiguous(), true);
+        out = -FLT_MAX;
+        top_count = 0.0f;
+
+        Tensor<xpu, 4, DType> trans;
+        if (!param_.no_trans) {
+          trans = in_data[deformablepsroipool::kTrans].get<xpu, 4, DType>(s);
+        }
+        DeformablePSROIPoolForward(out, data, bbox, trans, top_count, param_.no_trans, param_.spatial_scale, 
+          param_.output_dim, param_.group_size, param_.pooled_size, param_.part_size, param_.sample_per_part, param_.trans_std);
+      }
+
+      virtual void Backward(const OpContext &ctx,
+        const std::vector<TBlob> &out_grad,
+        const std::vector<TBlob> &in_data,
+        const std::vector<TBlob> &out_data,
+        const std::vector<OpReqType> &req,
+        const std::vector<TBlob> &in_grad,
+        const std::vector<TBlob> &aux_args) {
+        using namespace mshadow;
+        size_t in_expected = param_.no_trans ? 2 : 3;
+        size_t out_expected = 2;
+        CHECK_EQ(in_data.size(), in_expected);
+        CHECK_EQ(out_data.size(), out_expected);
+        CHECK_EQ(out_grad[deformablepsroipool::kOut].shape_[0], in_data[deformablepsroipool::kBox].shape_[0]);
+        CHECK_EQ(out_data[deformablepsroipool::kTopCount].shape_[0], in_data[deformablepsroipool::kBox].shape_[0]);
+        CHECK_NE(req[deformablepsroipool::kData], kWriteInplace) <<
+          "DeformablePSROIPooling: Backward doesn't support kWriteInplace.";
+        CHECK_NE(req[deformablepsroipool::kBox], kWriteInplace) <<
+          "DeformablePSROIPooling: Backward doesn't support kWriteInplace.";
+        // CHECK_NE(req[deformablepsroipool::kTrans], kWriteInplace) <<
+        //  "DeformablePSROIPooling: Backward doesn't support kWriteInplace.";
+        Stream<xpu> *s = ctx.get_stream<xpu>();
+
+        Tensor<xpu, 4, DType> grad_out = out_grad[deformablepsroipool::kOut].get<xpu, 4, DType>(s);
+        Tensor<xpu, 4, DType> data = in_data[deformablepsroipool::kData].get<xpu, 4, DType>(s);
+        Tensor<xpu, 2, DType> bbox = in_data[deformablepsroipool::kBox].get<xpu, 2, DType>(s);
+        Tensor<xpu, 4, DType> top_count = out_data[deformablepsroipool::kTopCount].get<xpu, 4, DType>(s);
+        Tensor<xpu, 4, DType> grad_in = in_grad[deformablepsroipool::kData].get<xpu, 4, DType>(s);
+        Tensor<xpu, 2, DType> grad_roi = in_grad[deformablepsroipool::kBox].get<xpu, 2, DType>(s);
+        Tensor<xpu, 4, DType> grad_trans;
+        Tensor<xpu, 4, DType> trans;
+        if (!param_.no_trans) {
+          CHECK_EQ(in_grad.size(), 3);
+          trans = in_data[deformablepsroipool::kTrans].get<xpu, 4, DType>(s);
+          grad_trans = in_grad[deformablepsroipool::kTrans].get<xpu, 4, DType>(s);
+        }
+
+        CHECK_EQ(grad_out.CheckContiguous(), true);
+        CHECK_EQ(data.CheckContiguous(), true);
+        CHECK_EQ(bbox.CheckContiguous(), true);
+        CHECK_EQ(top_count.CheckContiguous(), true);
+        CHECK_EQ(grad_in.CheckContiguous(), true);
+
+        Assign(grad_in, req[deformablepsroipool::kData], 0);
+        if (!param_.no_trans) {
+          Assign(grad_trans, req[deformablepsroipool::kTrans], 0);
+        }
+        DeformablePSROIPoolBackwardAcc(grad_in, grad_trans, grad_out, data, bbox, trans, top_count, param_.no_trans,
+          param_.spatial_scale, param_.output_dim, param_.group_size, param_.pooled_size, param_.part_size,
+          param_.sample_per_part, param_.trans_std);
+        Assign(grad_roi, req[deformablepsroipool::kBox], 0);
+      }
+
+    private:
+      DeformablePSROIPoolingParam param_;
+    };  // class DeformablePSROIPoolingOp
+
+      // Decalre Factory function, used for dispatch specialization
+    template<typename xpu>
+    Operator* CreateOp(DeformablePSROIPoolingParam param, int dtype);
+
+#if DMLC_USE_CXX11
+    class DeformablePSROIPoolingProp : public OperatorProperty {
+    public:
+      std::vector<std::string> ListArguments() const override {
+        if (param_.no_trans) {
+          return{ "data", "rois" };
+        }
+        else {
+          return{ "data", "rois", "trans" };
+        }
+      }
+
+      std::vector<std::string> ListOutputs() const override {
+        return{ "output", "top_count" };
+      }
+
+      int NumOutputs() const override {
+        return 2;
+      }
+
+      int NumVisibleOutputs() const override {
+        return 1;
+      }
+
+      void Init(const std::vector<std::pair<std::string, std::string> >& kwargs) override {
+        param_.Init(kwargs);
+        if (param_.part_size == 0) {
+          param_.part_size = param_.pooled_size;
+        }
+      }
+
+      std::map<std::string, std::string> GetParams() const override {
+        return param_.__DICT__();
+      }
+
+      bool InferShape(std::vector<TShape> *in_shape,
+        std::vector<TShape> *out_shape,
+        std::vector<TShape> *aux_shape) const override {
+        using namespace mshadow;
+        if (param_.no_trans) {
+          CHECK_EQ(in_shape->size(), 2) << "Input:[data, rois]";
+        }
+        else {
+          CHECK_EQ(in_shape->size(), 3) << "Input:[data, rois, trans]";
+          // trans: [num_rois, 2, pooled_h, pooled_w]
+          TShape tshape = in_shape->at(deformablepsroipool::kTrans);
+          CHECK_EQ(tshape.ndim(), 4) << "trans should be a 4D tensor of shape";
+        }
+
+        // data: [batch_size, c, h, w]
+        TShape dshape = in_shape->at(deformablepsroipool::kData);
+        CHECK_EQ(dshape.ndim(), 4) << "data should be a 4D tensor";
+
+        // bbox: [num_rois, 5]
+        TShape bshape = in_shape->at(deformablepsroipool::kBox);
+        CHECK_EQ(bshape.ndim(), 2) << "bbox should be a 2D tensor of shape [batch, 5]";
+        CHECK_EQ(bshape[1], 5) << "bbox should be a 2D tensor of shape [batch, 5]";
+
+        // out: [num_rois, c, pooled_h, pooled_w]
+        // top_count: [num_rois, c, pooled_h, pooled_w]
+        out_shape->clear();
+        out_shape->push_back(
+          Shape4(bshape[0], param_.output_dim, param_.pooled_size, param_.pooled_size));
+        out_shape->push_back(
+          Shape4(bshape[0], param_.output_dim, param_.pooled_size, param_.pooled_size));
+        return true;
+      }
+
+      bool InferType(std::vector<int> *in_type,
+        std::vector<int> *out_type,
+        std::vector<int> *aux_type) const override {
+        CHECK_GE(in_type->size(), 2);
+        int dtype = (*in_type)[0];
+        CHECK_EQ(dtype, (*in_type)[1]);
+        CHECK_NE(dtype, -1) << "Input must have specified type";
+
+        out_type->clear();
+        out_type->push_back(dtype);
+        out_type->push_back(dtype);
+        return true;
+      }
+
+      OperatorProperty* Copy() const override {
+        DeformablePSROIPoolingProp* deformable_psroi_pooling_sym = new DeformablePSROIPoolingProp();
+        deformable_psroi_pooling_sym->param_ = this->param_;
+        return deformable_psroi_pooling_sym;
+      }
+
+      std::string TypeString() const override {
+        return "_contrib_DeformablePSROIPooling";
+      }
+
+      // decalre dependency and inplace optimization options
+      std::vector<int> DeclareBackwardDependency(
+        const std::vector<int> &out_grad,
+        const std::vector<int> &in_data,
+        const std::vector<int> &out_data) const override {
+        if (param_.no_trans) {
+          return{ out_grad[deformablepsroipool::kOut], in_data[deformablepsroipool::kData], in_data[deformablepsroipool::kBox],
+            out_data[deformablepsroipool::kTopCount] };
+        }
+        else {
+          return{ out_grad[deformablepsroipool::kOut], in_data[deformablepsroipool::kData], in_data[deformablepsroipool::kBox],
+            in_data[deformablepsroipool::kTrans], out_data[deformablepsroipool::kTopCount] };
+        }
+      }
+
+
+      Operator* CreateOperator(Context ctx) const override {
+        LOG(FATAL) << "Not Implemented.";
+        return NULL;
+      }
+
+      Operator* CreateOperatorEx(Context ctx, std::vector<TShape> *in_shape,
+        std::vector<int> *in_type) const override;
+
+
+    private:
+      DeformablePSROIPoolingParam param_;
+    };  // class DeformablePSROIPoolingProp
+#endif
+  }  // namespace op
+}  // namespace mxnet
+#endif  // MXNET_OPERATOR_DEFORMABLE_PSROI_POOLING_INL_H_
\ No newline at end of file
diff --git a/faster_rcnn/operator_cxx/deformable_psroi_pooling.cc b/faster_rcnn/operator_cxx/deformable_psroi_pooling.cc
new file mode 100644
index 0000000..4a21a79
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_psroi_pooling.cc
@@ -0,0 +1,96 @@
+/*!
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_psroi_pooling.cc
+ * \brief 
+ * \author Yi Li, Guodong Zhang, Jifeng Dai
+*/
+#include "./deformable_psroi_pooling-inl.h"
+#include <mshadow/base.h>
+#include <mshadow/tensor.h>
+#include <mshadow/packet-inl.h>
+#include <mshadow/dot_engine-inl.h>
+#include <cassert>
+
+using std::max;
+using std::min;
+using std::floor;
+using std::ceil;
+
+namespace mshadow {
+  template<typename DType>
+  inline void DeformablePSROIPoolForward(const Tensor<cpu, 4, DType> &out,
+    const Tensor<cpu, 4, DType> &data,
+    const Tensor<cpu, 2, DType> &bbox,
+    const Tensor<cpu, 4, DType> &trans,
+    const Tensor<cpu, 4, DType> &top_count,
+    const bool no_trans,
+    const float spatial_scale,
+    const int output_dim,
+    const int group_size,
+    const int pooled_size,
+    const int part_size,
+    const int sample_per_part,
+    const float trans_std) {
+    // NOT_IMPLEMENTED;
+    return;
+  }
+
+  template<typename DType>
+  inline void DeformablePSROIPoolBackwardAcc(const Tensor<cpu, 4, DType> &in_grad,
+    const Tensor<cpu, 4, DType> &trans_grad,
+    const Tensor<cpu, 4, DType> &out_grad,
+    const Tensor<cpu, 4, DType> &data,
+    const Tensor<cpu, 2, DType> &bbox,
+    const Tensor<cpu, 4, DType> &trans,
+    const Tensor<cpu, 4, DType> &top_count,
+    const bool no_trans,
+    const float spatial_scale,
+    const int output_dim,
+    const int group_size,
+    const int pooled_size,
+    const int part_size,
+    const int sample_per_part,
+    const float trans_std) {
+    // NOT_IMPLEMENTED;
+    return;
+  }
+}  // namespace mshadow
+
+namespace mxnet {
+  namespace op {
+
+    template<>
+    Operator *CreateOp<cpu>(DeformablePSROIPoolingParam param, int dtype) {
+      Operator* op = NULL;
+      MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+        op = new DeformablePSROIPoolingOp<cpu, DType>(param);
+      });
+      return op;
+    }
+
+    Operator *DeformablePSROIPoolingProp::CreateOperatorEx(Context ctx, std::vector<TShape> *in_shape,
+      std::vector<int> *in_type) const {
+      std::vector<TShape> out_shape, aux_shape;
+      std::vector<int> out_type, aux_type;
+      CHECK(InferType(in_type, &out_type, &aux_type));
+      CHECK(InferShape(in_shape, &out_shape, &aux_shape));
+      DO_BIND_DISPATCH(CreateOp, param_, in_type->at(0));
+    }
+
+    DMLC_REGISTER_PARAMETER(DeformablePSROIPoolingParam);
+
+    MXNET_REGISTER_OP_PROPERTY(_contrib_DeformablePSROIPooling, DeformablePSROIPoolingProp)
+      .describe("Performs region-of-interest pooling on inputs. Resize bounding box coordinates by "
+        "spatial_scale and crop input feature maps accordingly. The cropped feature maps are pooled "
+        "by max pooling to a fixed size output indicated by pooled_size. batch_size will change to "
+        "the number of region bounding boxes after DeformablePSROIPooling")
+      .add_argument("data", "Symbol", "Input data to the pooling operator, a 4D Feature maps")
+      .add_argument("rois", "Symbol", "Bounding box coordinates, a 2D array of "
+        "[[batch_index, x1, y1, x2, y2]]. (x1, y1) and (x2, y2) are top left and down right corners "
+        "of designated region of interest. batch_index indicates the index of corresponding image "
+        "in the input data")
+      .add_argument("trans", "Symbol", "transition parameter")
+      .add_arguments(DeformablePSROIPoolingParam::__FIELDS__());
+  }  // namespace op
+}  // namespace mxnet
diff --git a/faster_rcnn/operator_cxx/deformable_psroi_pooling.cu b/faster_rcnn/operator_cxx/deformable_psroi_pooling.cu
new file mode 100644
index 0000000..5b8f361
--- /dev/null
+++ b/faster_rcnn/operator_cxx/deformable_psroi_pooling.cu
@@ -0,0 +1,402 @@
+/*!
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_psroi_pooling.cu
+ * \brief
+ * \author Yi Li, Guodong Zhang, Jifeng Dai
+*/
+#include "./deformable_psroi_pooling-inl.h"
+#include <mshadow/tensor.h>
+#include <mshadow/cuda/reduce.cuh>
+#include <algorithm>
+#include <vector>
+#include "../../common/cuda_utils.h"
+#include "../mxnet_op.h"
+
+#define DeformablePSROIPOOLING_CUDA_CHECK(condition) \
+  /* Code block avoids redefinition of cudaError_t error */ \
+  do { \
+    cudaError_t error = condition; \
+    CHECK_EQ(error, cudaSuccess) << " " << cudaGetErrorString(error); \
+  } while (0)
+#define CUDA_KERNEL_LOOP(i, n) \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
+      i < (n); \
+      i += blockDim.x * gridDim.x)
+
+namespace mshadow {
+  namespace cuda {
+    template <typename DType>
+    __device__ DType bilinear_interp(
+      const DType* data,
+      const DType x,
+      const DType y,
+      const int width,
+      const int height) {
+      int x1 = floor(x);
+      int x2 = ceil(x);
+      int y1 = floor(y);
+      int y2 = ceil(y);
+      DType dist_x = static_cast<DType>(x - x1);
+      DType dist_y = static_cast<DType>(y - y1);
+      DType value11 = data[y1*width + x1];
+      DType value12 = data[y2*width + x1];
+      DType value21 = data[y1*width + x2];
+      DType value22 = data[y2*width + x2];
+      DType value = (1 - dist_x)*(1 - dist_y)*value11 + (1 - dist_x)*dist_y*value12
+        + dist_x*(1 - dist_y)*value21 + dist_x*dist_y*value22;
+      return value;
+    }
+
+    template <typename DType>
+    __global__ void DeformablePSROIPoolForwardKernel(
+      const int count,
+      const DType* bottom_data,
+      const DType spatial_scale,
+      const int channels,
+      const int height, const int width,
+      const int pooled_height, const int pooled_width,
+      const DType* bottom_rois, const DType* bottom_trans,
+      const bool no_trans,
+      const DType trans_std,
+      const int sample_per_part,
+      const int output_dim,
+      const int group_size,
+      const int part_size,
+      const int num_classes,
+      const int channels_each_class,
+      DType* top_data,
+      DType* top_count) {
+      CUDA_KERNEL_LOOP(index, count) {
+        // The output is in order (n, ctop, ph, pw)
+        int pw = index % pooled_width;
+        int ph = (index / pooled_width) % pooled_height;
+        int ctop = (index / pooled_width / pooled_height) % output_dim;
+        int n = index / pooled_width / pooled_height / output_dim;
+
+        // [start, end) interval for spatial sampling
+        const DType* offset_bottom_rois = bottom_rois + n * 5;
+        int roi_batch_ind = offset_bottom_rois[0];
+        DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+        DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+        DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+        DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+
+        // Force too small ROIs to be 1x1
+        DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+        DType roi_height = max(roi_end_h - roi_start_h, 0.1);
+
+        // Compute w and h at bottom
+        DType bin_size_h = roi_height / static_cast<DType>(pooled_height);
+        DType bin_size_w = roi_width / static_cast<DType>(pooled_width);
+
+        DType sub_bin_size_h = bin_size_h / static_cast<DType>(sample_per_part);
+        DType sub_bin_size_w = bin_size_w / static_cast<DType>(sample_per_part);
+
+        int part_h = floor(static_cast<DType>(ph) / pooled_height*part_size);
+        int part_w = floor(static_cast<DType>(pw) / pooled_width*part_size);
+        int class_id = ctop / channels_each_class;
+        DType trans_x = no_trans ? static_cast<DType>(0) :
+          bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h)*part_size + part_w] * trans_std;
+        DType trans_y = no_trans ? static_cast<DType>(0) :
+          bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h)*part_size + part_w] * trans_std;
+        
+        DType wstart = static_cast<DType>(pw)* bin_size_w
+          + roi_start_w;
+        wstart += trans_x * roi_width;
+        DType hstart = static_cast<DType>(ph) * bin_size_h
+          + roi_start_h;
+        hstart += trans_y * roi_height;
+        
+        DType sum = 0;
+        int count = 0;
+        int gw = floor(static_cast<DType>(pw) * group_size / pooled_width);
+        int gh = floor(static_cast<DType>(ph)* group_size / pooled_height);
+        gw = min(max(gw, 0), group_size - 1);
+        gh = min(max(gh, 0), group_size - 1);
+
+        const DType* offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;
+        for (int ih = 0; ih < sample_per_part; ih++) {
+          for (int iw = 0; iw < sample_per_part; iw++) {
+            DType w = wstart + iw*sub_bin_size_w;
+            DType h = hstart + ih*sub_bin_size_h;
+            // bilinear interpolation
+            if (w<-0.5 || w>width - 0.5 || h<-0.5 || h>height - 0.5) {
+              continue;
+            }
+            w = min(max(w, 0.), width - 1.);
+            h = min(max(h, 0.), height - 1.);
+            int c = (ctop*group_size + gh)*group_size + gw;
+            DType val = bilinear_interp(offset_bottom_data + c*height*width, w, h, width, height);
+            sum += val;
+            count++;
+          }
+        }
+        top_data[index] = count == 0 ? static_cast<DType>(0) : sum / count;
+        top_count[index] = count;
+      }
+    }
+
+    template<typename DType>
+    inline void DeformablePSROIPoolForward(const Tensor<gpu, 4, DType> &out,
+      const Tensor<gpu, 4, DType> &data,
+      const Tensor<gpu, 2, DType> &bbox,
+      const Tensor<gpu, 4, DType> &trans,
+      const Tensor<gpu, 4, DType> &top_count,
+      const bool no_trans,
+      const float spatial_scale,
+      const int output_dim,
+      const int group_size,
+      const int pooled_size,
+      const int part_size,
+      const int sample_per_part,
+      const float trans_std) {
+      // LOG(INFO) << "DeformablePSROIPoolForward";
+      const DType *bottom_data = data.dptr_;
+      const DType *bottom_rois = bbox.dptr_;
+      const DType *bottom_trans = no_trans ? NULL : trans.dptr_;
+      DType *top_data = out.dptr_;
+      DType *top_count_data = top_count.dptr_;
+      const int count = out.shape_.Size();
+      const int channels = data.size(1);
+      const int height = data.size(2);
+      const int width = data.size(3);
+      const int pooled_height = pooled_size;
+      const int pooled_width = pooled_size;
+      const int num_classes = no_trans ? 1 : trans.size(1) / 2;
+      const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;
+
+      cudaStream_t stream = Stream<gpu>::GetStream(out.stream_);
+      DeformablePSROIPoolForwardKernel<DType> << <mxnet::op::mxnet_op::cuda_get_num_blocks(count),
+        kBaseThreadNum, 0, stream >> >(
+        count, bottom_data, spatial_scale, channels, height, width, pooled_height, pooled_width,
+        bottom_rois, bottom_trans, no_trans, trans_std, sample_per_part, output_dim, 
+        group_size, part_size, num_classes, channels_each_class, top_data, top_count_data);
+      DeformablePSROIPOOLING_CUDA_CHECK(cudaPeekAtLastError());
+    }
+
+
+    template <typename DType>
+    __global__ void DeformablePSROIPoolBackwardAccKernel(
+      const int count,
+      const DType* top_diff,
+      const DType* top_count,
+      const int num_rois,
+      const DType spatial_scale,
+      const int channels,
+      const int height, const int width,
+      const int pooled_height, const int pooled_width,
+      const int output_dim,
+      DType* bottom_data_diff, DType* bottom_trans_diff,
+      const DType* bottom_data,
+      const DType* bottom_rois,
+      const DType* bottom_trans,
+      const bool no_trans,
+      const DType trans_std,
+      const int sample_per_part,
+      const int group_size,
+      const int part_size,
+      const int num_classes,
+      const int channels_each_class) {
+      CUDA_KERNEL_LOOP(index, count) {
+        // The output is in order (n, ctop, ph, pw)
+        int pw = index % pooled_width;
+        int ph = (index / pooled_width) % pooled_height;
+        int ctop = (index / pooled_width / pooled_height) % output_dim;
+        int n = index / pooled_width / pooled_height / output_dim;
+
+        // [start, end) interval for spatial sampling
+        const DType* offset_bottom_rois = bottom_rois + n * 5;
+        int roi_batch_ind = offset_bottom_rois[0];
+        DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+        DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+        DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+        DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+
+        // Force too small ROIs to be 1x1
+        DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+        DType roi_height = max(roi_end_h - roi_start_h, 0.1);
+
+        // Compute w and h at bottom
+        DType bin_size_h = roi_height / static_cast<DType>(pooled_height);
+        DType bin_size_w = roi_width / static_cast<DType>(pooled_width);
+
+        DType sub_bin_size_h = bin_size_h / static_cast<DType>(sample_per_part);
+        DType sub_bin_size_w = bin_size_w / static_cast<DType>(sample_per_part);
+
+        int part_h = floor(static_cast<DType>(ph) / pooled_height*part_size);
+        int part_w = floor(static_cast<DType>(pw) / pooled_width*part_size);
+        int class_id = ctop / channels_each_class;
+        DType trans_x = no_trans ? static_cast<DType>(0) :
+          bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h)*part_size + part_w] * trans_std;
+        DType trans_y = no_trans ? static_cast<DType>(0) :
+          bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h)*part_size + part_w] * trans_std;
+
+        DType wstart = static_cast<DType>(pw)* bin_size_w
+          + roi_start_w;
+        wstart += trans_x * roi_width;
+        DType hstart = static_cast<DType>(ph) * bin_size_h
+          + roi_start_h;
+        hstart += trans_y * roi_height;
+
+        if (top_count[index] <= 0) {
+          continue;
+        }
+        DType diff_val = top_diff[index] / top_count[index];
+        const DType* offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;
+        DType* offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;
+        int gw = floor(static_cast<DType>(pw)* group_size / pooled_width);
+        int gh = floor(static_cast<DType>(ph)* group_size / pooled_height);
+        gw = min(max(gw, 0), group_size - 1);
+        gh = min(max(gh, 0), group_size - 1);
+
+        for (int ih = 0; ih < sample_per_part; ih++) {
+          for (int iw = 0; iw < sample_per_part; iw++) {
+            DType w = wstart + iw*sub_bin_size_w;
+            DType h = hstart + ih*sub_bin_size_h;
+            // bilinear interpolation
+            if (w<-0.5 || w>width - 0.5 || h<-0.5 || h>height - 0.5) {
+              continue;
+            }
+            w = min(max(w, 0.), width - 1.);
+            h = min(max(h, 0.), height - 1.);
+            int c = (ctop*group_size + gh)*group_size + gw;
+            // backward on feature
+            int x0 = floor(w);
+            int x1 = ceil(w);
+            int y0 = floor(h);
+            int y1 = ceil(h);
+            DType dist_x = w - x0, dist_y = h - y0;
+            DType q00 = (1 - dist_x)*(1 - dist_y);
+            DType q01 = (1 - dist_x)*dist_y;
+            DType q10 = dist_x*(1 - dist_y);
+            DType q11 = dist_x*dist_y;
+            int bottom_index_base = c * height *width;
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y0*width + x0, q00*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y1*width + x0, q01*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y0*width + x1, q10*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y1*width + x1, q11*diff_val);
+
+            if (no_trans) {
+              continue;
+            }
+            DType U00 = offset_bottom_data[bottom_index_base + y0*width + x0];
+            DType U01 = offset_bottom_data[bottom_index_base + y1*width + x0];
+            DType U10 = offset_bottom_data[bottom_index_base + y0*width + x1];
+            DType U11 = offset_bottom_data[bottom_index_base + y1*width + x1];
+            DType diff_x = (U11*dist_y + U10*(1 - dist_y) - U01*dist_y - U00*(1 - dist_y))
+              *trans_std*diff_val;
+            diff_x *= roi_width;
+            DType diff_y = (U11*dist_x + U01*(1 - dist_x) - U10*dist_x - U00*(1 - dist_x))
+              *trans_std*diff_val;
+            diff_y *= roi_height;
+
+            atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2) * part_size + part_h)*part_size + part_w, diff_x);
+            atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2 + 1)*part_size + part_h)*part_size + part_w, diff_y);
+          }
+        }
+      }
+    }
+
+
+    template<typename DType>
+    inline void DeformablePSROIPoolBackwardAcc(const Tensor<gpu, 4, DType> &in_grad,
+      const Tensor<gpu, 4, DType> &trans_grad,
+      const Tensor<gpu, 4, DType> &out_grad,
+      const Tensor<gpu, 4, DType> &data,
+      const Tensor<gpu, 2, DType> &bbox,
+      const Tensor<gpu, 4, DType> &trans,
+      const Tensor<gpu, 4, DType> &top_count,
+      const bool no_trans,
+      const float spatial_scale,
+      const int output_dim,
+      const int group_size,
+      const int pooled_size,
+      const int part_size,
+      const int sample_per_part,
+      const float trans_std) {
+      // LOG(INFO) << "DeformablePSROIPoolBackward";
+      const DType *top_diff = out_grad.dptr_;
+      const DType *bottom_data = data.dptr_;
+      const DType *bottom_rois = bbox.dptr_;
+      const DType *bottom_trans = no_trans ? NULL : trans.dptr_;
+      DType *bottom_data_diff = in_grad.dptr_;
+      DType *bottom_trans_diff = no_trans ? NULL : trans_grad.dptr_;
+      const DType *top_count_data = top_count.dptr_;
+      const int count = out_grad.shape_.Size();
+      const int num_rois = bbox.size(0);
+      const int channels = in_grad.size(1);
+      const int height = in_grad.size(2);
+      const int width = in_grad.size(3);
+      const int pooled_height = pooled_size;
+      const int pooled_width = pooled_size;
+      const int num_classes = no_trans ? 1 : trans_grad.size(1) / 2;
+      const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;
+
+      cudaStream_t stream = Stream<gpu>::GetStream(in_grad.stream_);
+      DeformablePSROIPoolBackwardAccKernel<DType> << <mxnet::op::mxnet_op::cuda_get_num_blocks(count),
+        kBaseThreadNum, 0, stream >> >(
+        count, top_diff, top_count_data, num_rois, spatial_scale, channels, height, width,
+        pooled_height, pooled_width, output_dim, bottom_data_diff, bottom_trans_diff,
+        bottom_data, bottom_rois, bottom_trans, no_trans, trans_std, sample_per_part,
+        group_size, part_size, num_classes, channels_each_class);
+      DeformablePSROIPOOLING_CUDA_CHECK(cudaPeekAtLastError());
+    }
+
+  }  // namespace cuda
+
+  template<typename DType>
+  inline void DeformablePSROIPoolForward(const Tensor<gpu, 4, DType> &out,
+    const Tensor<gpu, 4, DType> &data,
+    const Tensor<gpu, 2, DType> &bbox,
+    const Tensor<gpu, 4, DType> &trans,
+    const Tensor<gpu, 4, DType> &top_count,
+    const bool no_trans,
+    const float spatial_scale,
+    const int output_dim,
+    const int group_size,
+    const int pooled_size,
+    const int part_size,
+    const int sample_per_part,
+    const float trans_std) {
+    cuda::DeformablePSROIPoolForward(out, data, bbox, trans, top_count, no_trans, spatial_scale,
+      output_dim, group_size, pooled_size, part_size, sample_per_part, trans_std);
+  }
+
+  template<typename DType>
+  inline void DeformablePSROIPoolBackwardAcc(const Tensor<gpu, 4, DType> &in_grad,
+    const Tensor<gpu, 4, DType> &trans_grad,
+    const Tensor<gpu, 4, DType> &out_grad,
+    const Tensor<gpu, 4, DType> &data,
+    const Tensor<gpu, 2, DType> &bbox,
+    const Tensor<gpu, 4, DType> &trans,
+    const Tensor<gpu, 4, DType> &top_count,
+    const bool no_trans,
+    const float spatial_scale,
+    const int output_dim,
+    const int group_size,
+    const int pooled_size,
+    const int part_size,
+    const int sample_per_part,
+    const float trans_std) {
+    cuda::DeformablePSROIPoolBackwardAcc(in_grad, trans_grad, out_grad, data, bbox, trans, top_count, no_trans,
+      spatial_scale, output_dim, group_size, pooled_size, part_size, sample_per_part, trans_std);
+  }
+
+}  // namespace mshadow
+
+
+namespace mxnet {
+  namespace op {
+
+    template<>
+    Operator* CreateOp<gpu>(DeformablePSROIPoolingParam param, int dtype) {
+      Operator* op = NULL;
+      MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+        op = new DeformablePSROIPoolingOp<gpu, DType>(param);
+      });
+      return op;
+    }
+
+  }  // namespace op
+}  // namespace mxnet
diff --git a/faster_rcnn/operator_cxx/nn/deformable_im2col.cuh b/faster_rcnn/operator_cxx/nn/deformable_im2col.cuh
new file mode 100644
index 0000000..d9e7b97
--- /dev/null
+++ b/faster_rcnn/operator_cxx/nn/deformable_im2col.cuh
@@ -0,0 +1,525 @@
+/*!
+ ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************
+ *
+ * COPYRIGHT
+ * 
+ * All contributions by the University of California:
+ * Copyright (c) 2014-2017 The Regents of the University of California (Regents)
+ * All rights reserved.
+ * 
+ * All other contributions:
+ * Copyright (c) 2014-2017, the respective contributors
+ * All rights reserved.
+ * 
+ * Caffe uses a shared copyright model: each contributor holds copyright over
+ * their contributions to Caffe. The project versioning records all such
+ * contribution and copyright details. If a contributor wants to further mark
+ * their specific copyright on a particular contribution, they should indicate
+ * their copyright solely in the commit message of the change when it is
+ * committed.
+ * 
+ * LICENSE
+ * 
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met: 
+ * 
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer. 
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution. 
+ * 
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ * 
+ * CONTRIBUTION AGREEMENT
+ * 
+ * By contributing to the BVLC/caffe repository through pull-request, comment,
+ * or otherwise, the contributor releases their content to the
+ * license and copyright terms herein.
+ *
+ ***************** END Caffe Copyright Notice and Disclaimer ********************
+ *
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_im2col.cuh
+ * \brief Function definitions of converting an image to
+ * column matrix based on kernel, padding, dilation, and offset.
+ * These functions are mainly used in deformable convolution operators.
+ * \ref: https://arxiv.org/abs/1703.06211
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
+ */
+
+#ifndef MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
+#define MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
+
+#include <mxnet/base.h>
+#include <mxnet/operator.h>
+#include <algorithm>
+#include <cstring>
+#include <vector>
+#include "../../mxnet_op.h"
+#include "../../../common/cuda_utils.h"
+
+
+
+namespace mxnet {
+namespace op {
+
+template <typename DType>
+__device__ DType deformable_im2col_bilinear(const DType* bottom_data, const int data_width, 
+  const int height, const int width, DType h, DType w) {
+
+  int h_low = floor(h);
+  int w_low = floor(w);
+  int h_high;
+  int w_high;
+  if (h_low >= height - 1) {
+    h_high = h_low = height - 1;
+    h = (DType)h_low;
+  }
+  else {
+    h_high = h_low + 1;
+  }
+
+  if (w_low >= width - 1) {
+    w_high = w_low = width - 1;
+    w = (DType)w_low;
+  }
+  else {
+    w_high = w_low + 1;
+  }
+
+  DType lh = h - h_low;
+  DType lw = w - w_low;
+  DType hh = 1 - lh, hw = 1 - lw;
+
+  DType v1 = bottom_data[h_low * data_width + w_low];
+  DType v2 = bottom_data[h_low * data_width + w_high];
+  DType v3 = bottom_data[h_high * data_width + w_low];
+  DType v4 = bottom_data[h_high * data_width + w_high];
+  DType w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
+
+  DType val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+  return val;
+}
+
+
+template <typename DType>
+__device__ DType get_gradient_weight(DType argmax_h, DType argmax_w, 
+  const int h, const int w, const int height, const int width) {
+
+  if (argmax_h < 0 || argmax_h > height || argmax_w < 0 || argmax_w > width) {
+    //empty
+    return 0;
+  }
+
+  argmax_h = max(argmax_h, (DType)0.0f);
+  argmax_w = max(argmax_w, (DType)0.0f);
+
+  int argmax_h_low = (int)argmax_h;
+  int argmax_w_low = (int)argmax_w;
+  int argmax_h_high;
+  int argmax_w_high;
+  if (argmax_h_low >= height - 1) {
+    argmax_h_high = argmax_h_low = height - 1;
+    argmax_h = (DType)argmax_h_low;
+  } else {
+    argmax_h_high = argmax_h_low + 1;
+  }
+  if (argmax_w_low >= width - 1)
+  {
+    argmax_w_high = argmax_w_low = width - 1;
+    argmax_w = (DType)argmax_w_low;
+  } else {
+    argmax_w_high = argmax_w_low + 1;
+  }
+  DType weight = 0;
+  if (h == argmax_h_low) {
+    if (w == argmax_w_low) {
+      weight = (h + 1 - argmax_h) * (w + 1 - argmax_w);
+    } else if (w == argmax_w_high) {
+      weight = (h + 1 - argmax_h) * (argmax_w + 1 - w);
+    }
+  } else if (h == argmax_h_high) {
+    if (w == argmax_w_low) {
+      weight = (argmax_h + 1 - h) * (w + 1 - argmax_w);
+    } else if (w == argmax_w_high) {
+      weight = (argmax_h + 1 - h) * (argmax_w + 1 - w);
+    }
+  }
+  return weight;
+}
+
+
+template <typename DType>
+__device__ DType get_coordinate_weight(DType argmax_h, DType argmax_w,
+  const int height, const int width, const DType* im_data,
+  const int data_width, const int bp_dir) {
+
+  if (argmax_h < 0 || argmax_h > height || argmax_w < 0 || argmax_w > width)
+  {
+    //empty
+    return 0;
+  }
+
+  if (argmax_h < 0) argmax_h = 0;
+  if (argmax_w < 0) argmax_w = 0;
+
+  int argmax_h_low = (int)argmax_h;
+  int argmax_w_low = (int)argmax_w;
+  int argmax_h_high;
+  int argmax_w_high;
+  if (argmax_h_low >= height - 1) {
+    argmax_h_high = argmax_h_low = height - 1;
+    argmax_h = (DType)argmax_h_low;
+  } else {
+    argmax_h_high = argmax_h_low + 1;
+  }
+  if (argmax_w_low >= width - 1) {
+    argmax_w_high = argmax_w_low = width - 1;
+    argmax_w = (DType)argmax_w_low;
+  } else {
+    argmax_w_high = argmax_w_low + 1;
+  }
+  DType weight = 0;
+
+  if (bp_dir == 0) {
+    weight += -1 * (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_low * data_width + argmax_w_low];
+    weight += -1 * (argmax_w - argmax_w_low) * im_data[argmax_h_low * data_width + argmax_w_high];
+    weight += (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_high * data_width + argmax_w_low];
+    weight += (argmax_w - argmax_w_low) * im_data[argmax_h_high * data_width + argmax_w_high];
+  } else if (bp_dir == 1) {
+    weight += -1 * (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_low];
+    weight += (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_high];
+    weight += -1 * (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_low];
+    weight += (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_high];
+  }
+
+  return weight;
+}
+
+
+/*!
+ * \brief deformable_im2col gpu kernel.
+ * DO NOT call this directly. Use wrapper function im2col() instead;
+ */
+template <typename DType>
+__global__ void deformable_im2col_gpu_kernel(const int n, const DType* data_im, const DType* data_offset,
+  const int height, const int width, const int kernel_h, const int kernel_w,
+  const int pad_h, const int pad_w,
+  const int stride_h, const int stride_w,
+  const int dilation_h, const int dilation_w,
+  const int channel_per_deformable_group,
+  const int height_col, const int width_col,
+  DType* data_col) {
+  CUDA_KERNEL_LOOP(index, n) {
+    // index index of output matrix
+    const int w_col = index % width_col;
+    const int h_col = (index / width_col) % height_col;
+    const int c_im = (index / width_col) / height_col;
+    const int c_col = c_im * kernel_h * kernel_w;
+
+    // compute deformable group index
+    const int deformable_group_index = c_im / channel_per_deformable_group;
+
+    const int h_in = h_col * stride_h - pad_h;
+    const int w_in = w_col * stride_w - pad_w;
+    DType* data_col_ptr = data_col + (c_col * height_col + h_col) * width_col + w_col;
+    const DType* data_im_ptr = data_im + (c_im * height + h_in) * width + w_in;
+    const DType* data_offset_ptr = data_offset + deformable_group_index * 2 * kernel_h * kernel_w * height_col * width_col;
+
+
+    for (int i = 0; i < kernel_h; ++i) {
+      for (int j = 0; j < kernel_w; ++j) {
+        const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_col) * width_col + w_col;
+        const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_col) * width_col + w_col;
+        const DType offset_h = data_offset_ptr[data_offset_h_ptr];
+        const DType offset_w = data_offset_ptr[data_offset_w_ptr];
+        DType val = static_cast<DType>(0);
+        const DType h_im = h_in + i * dilation_h + offset_h;
+        const DType w_im = w_in + j * dilation_w + offset_w;
+        if (h_im >= 0 && w_im >= 0 && h_im < height && w_im < width) {
+          const DType map_h = i * dilation_h + offset_h;
+          const DType map_w = j * dilation_w + offset_w;
+          const int cur_height = height - h_in;
+          const int cur_width = width - w_in;
+          val = deformable_im2col_bilinear(data_im_ptr, width, cur_height, cur_width, map_h, map_w);
+        }
+        *data_col_ptr = val;
+        data_col_ptr += height_col * width_col;
+      }
+    }
+  }
+}
+
+
+
+
+
+
+/*!\brief
+ * cpu function of deformable_im2col algorithm
+ * \param s device stream
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape (#channels, output_im_height, output_im_width, ...)
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param data_col column buffer pointer
+ */
+template <typename DType>
+inline void deformable_im2col(mshadow::Stream<gpu>* s,
+  const DType* data_im, const DType* data_offset, 
+  const TShape& im_shape, const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride, const TShape& dilation, 
+  const uint32_t deformable_group, DType* data_col) {
+  // num_axes should be smaller than block size
+  index_t num_spatial_axes = kernel_shape.ndim();
+  CHECK_LT(num_spatial_axes, mshadow::cuda::kBaseThreadNum);
+  index_t channel_per_deformable_group = im_shape[1] / deformable_group;
+  index_t num_kernels = im_shape[1] * col_shape.ProdShape(1, col_shape.ndim());
+  using namespace mxnet_op;
+  switch (num_spatial_axes) {
+  case 2:
+    deformable_im2col_gpu_kernel<DType> // NOLINT_NEXT_LINE(whitespace/operators)
+        <<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
+           0, mshadow::Stream<gpu>::GetStream(s)>>>(
+        num_kernels, data_im, data_offset, im_shape[2], im_shape[3], kernel_shape[0], kernel_shape[1],
+        pad[0], pad[1], stride[0], stride[1], dilation[0], dilation[1], channel_per_deformable_group,
+        col_shape[1], col_shape[2], data_col);
+    MSHADOW_CUDA_POST_KERNEL_CHECK(deformable_im2col_gpu_kernel);
+    break;
+  default:
+    LOG(FATAL) << "im2col_nd_gpu does not support computation with "
+               << num_spatial_axes << " spatial axes";
+  }
+}
+
+
+/*!
+* \brief deformable_col2im gpu kernel.
+* \brief DO NOT call this directly. Use wrapper function deformable_col2im() instead;
+*/
+template <typename DType>
+__global__ void deformable_col2im_gpu_kernel(const int n, const DType* data_col, const DType* data_offset,
+  const int channels, const int height, const int width,
+  const int kernel_h, const int kernel_w,
+  const int pad_h, const int pad_w,
+  const int stride_h, const int stride_w,
+  const int dilation_h, const int dilation_w,
+  const int channel_per_deformable_group,
+  const int height_col, const int width_col,
+  DType* grad_im, OpReqType req) {
+  CUDA_KERNEL_LOOP(index, n) {
+    const int j = (index / width_col / height_col) % kernel_w;
+    const int i = (index / width_col / height_col / kernel_w) % kernel_h;
+    const int c = index / width_col / height_col / kernel_w / kernel_h;
+    // compute the start and end of the output
+
+    const int deformable_group_index = c / channel_per_deformable_group;
+
+    int w_out = index % width_col;
+    int h_out = (index / width_col) % height_col;
+    int w_in = w_out * stride_w - pad_w;
+    int h_in = h_out * stride_h - pad_h;
+
+    const DType* data_offset_ptr = data_offset + deformable_group_index * 2 * kernel_h * kernel_w * height_col * width_col;
+    const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out;
+    const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out;
+    const DType offset_h = data_offset_ptr[data_offset_h_ptr];
+    const DType offset_w = data_offset_ptr[data_offset_w_ptr];
+    const DType cur_inv_h_data = h_in + i * dilation_h + offset_h;
+    const DType cur_inv_w_data = w_in + j * dilation_w + offset_w;
+
+    const DType cur_top_grad = data_col[index];
+    const int cur_h = (int)cur_inv_h_data;
+    const int cur_w = (int)cur_inv_w_data;
+    for (int dy = -2; dy <= 2; dy++) {
+      for (int dx = -2; dx <= 2; dx++) {
+        if (cur_h + dy >= 0 && cur_h + dy < height &&
+          cur_w + dx >= 0 && cur_w + dx < width &&
+          abs(cur_inv_h_data - (cur_h + dy)) < 1 &&
+          abs(cur_inv_w_data - (cur_w + dx)) < 1
+          ) {
+          int cur_bottom_grad_pos = (c * height + cur_h + dy) * width + cur_w + dx;
+          DType weight = get_gradient_weight(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);
+          atomicAdd(grad_im + cur_bottom_grad_pos, weight * cur_top_grad);
+        }
+      }
+    }
+  }
+}
+
+
+/*!\brief
+ * gpu function of deformable_col2im algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_im pointer of a image (C, H, W,...) in the image batch
+ */
+template <typename DType>
+inline void deformable_col2im(mshadow::Stream<gpu>* s,
+  const DType* data_col, const DType* data_offset,
+  const TShape& im_shape, const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride,
+  const TShape& dilation, const uint32_t deformable_group,
+  DType* grad_im, OpReqType req) {
+  index_t num_spatial_axes = kernel_shape.ndim();
+  index_t im_size = im_shape.ProdShape(1, im_shape.ndim());
+  index_t channel_per_deformable_group = im_shape[1] / deformable_group;
+  index_t num_kernels = col_shape.ProdShape(0, col_shape.ndim());
+  // num_axes should be smaller than block size
+  CHECK_LT(num_spatial_axes, mshadow::cuda::kBaseThreadNum);
+  using namespace mxnet_op;
+  switch (num_spatial_axes) {
+  case 2:
+    // To avoid involving atomic operations, we will launch one kernel per
+    // bottom dimension, and then in the kernel add up the top dimensions.
+    // NOLINT_NEXT_LINE(whitespace/operators)
+    deformable_col2im_gpu_kernel<DType><<<cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
+                               0, mshadow::Stream<gpu>::GetStream(s)>>>(
+        num_kernels, data_col, data_offset, im_shape[1], im_shape[2], im_shape[3],
+        kernel_shape[0], kernel_shape[1], pad[0], pad[1], stride[0], stride[1],
+        dilation[0], dilation[1], channel_per_deformable_group, col_shape[1], col_shape[2], grad_im, req);
+    MSHADOW_CUDA_POST_KERNEL_CHECK(deformable_col2im_gpu_kernel);
+    break;
+  default:
+    LOG(FATAL) << "col2im_nd_gpu does not support computation with "
+               << num_spatial_axes << " spatial axes";
+  }
+}
+
+
+/*!
+ * \brief deformable_col2im_coord gpu kernel.
+ * \brief DO NOT call this directly. Use wrapper function deformable_col2im_coord() instead;
+ */
+template <typename DType>
+__global__ void deformable_col2im_coord_gpu_kernel(const int n, const DType* data_col, 
+  const DType* data_im, const DType* data_offset,
+  const int channels, const int height, const int width,
+  const int kernel_h, const int kernel_w,
+  const int pad_h, const int pad_w,
+  const int stride_h, const int stride_w,
+  const int dilation_h, const int dilation_w,
+  const int channel_per_deformable_group,
+  const int height_col, const int width_col,
+  DType* grad_offset, OpReqType req) {
+  CUDA_KERNEL_LOOP(index, n) {
+    DType val = 0;
+    int w = index % width_col;
+    int h = (index / width_col) % height_col;
+    int c = index / width_col / height_col;
+    // compute the start and end of the output
+
+    const int deformable_group_index = c / (2 * kernel_h * kernel_w);
+    const int col_step = kernel_h * kernel_w;
+    int cnt = 0;
+    const DType* data_col_ptr = data_col + deformable_group_index * channel_per_deformable_group * width_col * height_col;
+    const DType* data_im_ptr = data_im + deformable_group_index * channel_per_deformable_group / kernel_h / kernel_w * height * width;
+    const DType* data_offset_ptr = data_offset + deformable_group_index * 2 * kernel_h * kernel_w * height_col * width_col;
+
+    const int offset_c = c - deformable_group_index * 2 * kernel_h * kernel_w;
+
+    for (int col_c = (offset_c / 2); col_c < channel_per_deformable_group; col_c += col_step) {
+      const int col_pos = ((col_c * height_col) + h) * width_col + w;
+      const int bp_dir = offset_c % 2;
+
+      int j = (col_pos / width_col / height_col) % kernel_w;
+      int i = (col_pos / width_col / height_col / kernel_w) % kernel_h;
+      int w_out = col_pos % width_col;
+      int h_out = (col_pos / width_col) % height_col;
+      int w_in = w_out * stride_w - pad_w;
+      int h_in = h_out * stride_h - pad_h;
+      const int data_offset_h_ptr = (((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out);
+      const int data_offset_w_ptr = (((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out);
+      const DType offset_h = data_offset_ptr[data_offset_h_ptr];
+      const DType offset_w = data_offset_ptr[data_offset_w_ptr];
+      DType inv_h = h_in + i * dilation_h + offset_h;
+      DType inv_w = w_in + j * dilation_w + offset_w;
+      if (inv_h < 0 || inv_w < 0 || inv_h >= height || inv_w >= width) {
+        inv_h = inv_w = -1;
+      }
+      const DType weight = get_coordinate_weight(
+        inv_h, inv_w,
+        height, width, data_im_ptr + cnt * height * width, width, bp_dir);
+      val += weight * data_col_ptr[col_pos];
+      cnt += 1;
+    }
+
+    grad_offset[index] = val;
+  }
+}
+
+/*!\brief
+ * gpu function of deformable_col2im_coord algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_offset pointer of the offset (C, H, W,...) in the offset batch
+ */
+template <typename DType>
+inline void deformable_col2im_coord(mshadow::Stream<gpu>* s,
+  const DType* data_col, const DType* data_im, const DType* data_offset, const TShape& im_shape,
+  const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride,
+  const TShape& dilation, const uint32_t deformable_group, DType* grad_offset, OpReqType req) {
+  index_t num_spatial_axes = kernel_shape.ndim();
+  index_t num_kernels = col_shape[1] * col_shape[2] * 2 * kernel_shape[0] * kernel_shape[1] * deformable_group;
+  index_t channel_per_deformable_group = col_shape[0] / deformable_group;
+  // num_axes should be smaller than block size
+  CHECK_LT(num_spatial_axes, mshadow::cuda::kBaseThreadNum);
+  using namespace mxnet_op;
+  switch (num_spatial_axes) {
+  case 2:
+    // To avoid involving atomic operations, we will launch one kernel per
+    // bottom dimension, and then in the kernel add up the top dimensions.
+    // NOLINT_NEXT_LINE(whitespace/operators)
+
+    deformable_col2im_coord_gpu_kernel<DType> << <cuda_get_num_blocks(num_kernels), mshadow::cuda::kBaseThreadNum,
+      0, mshadow::Stream<gpu>::GetStream(s) >> >(
+        num_kernels, data_col, data_im, data_offset, im_shape[1], im_shape[2], im_shape[3],
+        kernel_shape[0], kernel_shape[1], pad[0], pad[1], stride[0], stride[1],
+        dilation[0], dilation[1], channel_per_deformable_group, col_shape[1], col_shape[2], grad_offset, req);
+    MSHADOW_CUDA_POST_KERNEL_CHECK(deformable_col2im_gpu_kernel);
+    break;
+  default:
+    LOG(FATAL) << "col2im_nd_gpu does not support computation with "
+      << num_spatial_axes << " spatial axes";
+  }
+}
+
+
+}  // namespace op
+}  // namespace mxnet
+
+#endif  // MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
diff --git a/faster_rcnn/operator_cxx/nn/deformable_im2col.h b/faster_rcnn/operator_cxx/nn/deformable_im2col.h
new file mode 100644
index 0000000..93a5551
--- /dev/null
+++ b/faster_rcnn/operator_cxx/nn/deformable_im2col.h
@@ -0,0 +1,157 @@
+/*!
+ ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************
+ *
+ * COPYRIGHT
+ * 
+ * All contributions by the University of California:
+ * Copyright (c) 2014-2017 The Regents of the University of California (Regents)
+ * All rights reserved.
+ * 
+ * All other contributions:
+ * Copyright (c) 2014-2017, the respective contributors
+ * All rights reserved.
+ * 
+ * Caffe uses a shared copyright model: each contributor holds copyright over
+ * their contributions to Caffe. The project versioning records all such
+ * contribution and copyright details. If a contributor wants to further mark
+ * their specific copyright on a particular contribution, they should indicate
+ * their copyright solely in the commit message of the change when it is
+ * committed.
+ * 
+ * LICENSE
+ * 
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met: 
+ * 
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer. 
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution. 
+ * 
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ * 
+ * CONTRIBUTION AGREEMENT
+ * 
+ * By contributing to the BVLC/caffe repository through pull-request, comment,
+ * or otherwise, the contributor releases their content to the
+ * license and copyright terms herein.
+ *
+ ***************** END Caffe Copyright Notice and Disclaimer ********************
+ *
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file deformable_im2col.h
+ * \brief Function definitions of converting an image to
+ * column matrix based on kernel, padding, dilation, and offset.
+ * These functions are mainly used in deformable convolution operators.
+ * \ref: https://arxiv.org/abs/1703.06211
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
+ */
+
+#ifndef MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
+#define MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
+
+#include <mxnet/base.h>
+#include <mxnet/operator.h>
+#include <cstring>
+#include <vector>
+#include "../../mxnet_op.h"
+
+namespace mxnet {
+namespace op {
+
+/*!\brief 
+ * cpu function of deformable_im2col algorithm
+ * \param s device stream
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape (#channels, output_im_height, output_im_width, ...)
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param data_col column buffer pointer
+ */
+template <typename DType>
+inline void deformable_im2col(mshadow::Stream<cpu>* s,
+  const DType* data_im, const DType* data_offset, 
+  const TShape& im_shape, const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride, const TShape& dilation, 
+  const uint32_t deformable_group, DType* data_col) {
+  if (2 == kernel_shape.ndim()) {
+	  LOG(FATAL) << "not implemented";
+  } else {
+	  LOG(FATAL) << "not implemented";
+  }
+}
+
+
+/*!\brief
+ * cpu function of deformable_col2im algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_im pointer of a image (C, H, W,...) in the image batch
+ */
+template <typename DType>
+inline void deformable_col2im(mshadow::Stream<cpu>* s,
+  const DType* data_col, const DType* data_offset,
+  const TShape& im_shape, const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride,
+  const TShape& dilation, const uint32_t deformable_group,
+  DType* grad_im, OpReqType req) {
+  index_t num_spatial_axes = kernel_shape.ndim();
+  LOG(FATAL) << "not implemented";
+}
+
+
+/*!\brief
+ * cpu function of deformable_col2im_coord algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_offset pointer of the offset (C, H, W,...) in the offset batch
+ */
+
+template <typename DType>
+inline void deformable_col2im_coord(mshadow::Stream<cpu>* s,
+  const DType* data_col, const DType* data_im, const DType* data_offset, const TShape& im_shape,
+  const TShape& col_shape, const TShape& kernel_shape,
+  const TShape& pad, const TShape& stride,
+  const TShape& dilation, const uint32_t deformable_group, DType* grad_offset, OpReqType req) {
+  LOG(FATAL) << "not implemented";
+}
+
+}  // namespace op
+}  // namespace mxnet
+#ifdef __CUDACC__
+#include "./deformable_im2col.cuh"
+#endif
+#endif  // MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
diff --git a/faster_rcnn/operator_cxx/psroi_pooling-inl.h b/faster_rcnn/operator_cxx/psroi_pooling-inl.h
new file mode 100644
index 0000000..956861c
--- /dev/null
+++ b/faster_rcnn/operator_cxx/psroi_pooling-inl.h
@@ -0,0 +1,235 @@
+/*!
+ * Copyright (c) 2017 by Contributors
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file psroi_pooling-inl.h
+ * \brief psroi pooling operator and symbol
+ * \author Yi Li, Tairui Chen, Guodong Zhang, Jifeng Dai
+*/
+#ifndef MXNET_OPERATOR_PSROI_POOLING_INL_H_
+#define MXNET_OPERATOR_PSROI_POOLING_INL_H_
+
+#include <dmlc/logging.h>
+#include <dmlc/parameter.h>
+#include <mxnet/operator.h>
+#include <map>
+#include <vector>
+#include <string>
+#include <utility>
+#include "../mshadow_op.h"
+#include "../operator_common.h"
+
+
+namespace mxnet {
+namespace op {
+
+// Declare enumeration of input order to make code more intuitive.
+// These enums are only visible within this header
+namespace psroipool {
+enum PSROIPoolingOpInputs {kData, kBox};
+enum PSROIPoolingOpOutputs {kOut, kMappingChannel};
+}  // psroipool
+
+struct PSROIPoolingParam : public dmlc::Parameter<PSROIPoolingParam> {
+  // TShape pooled_size;
+  float spatial_scale;
+  int output_dim;
+  int pooled_size;
+  int group_size;
+  DMLC_DECLARE_PARAMETER(PSROIPoolingParam) {
+    DMLC_DECLARE_FIELD(spatial_scale).set_range(0.0, 1.0)
+    .describe("Ratio of input feature map height (or w) to raw image height (or w). "
+    "Equals the reciprocal of total stride in convolutional layers");
+    DMLC_DECLARE_FIELD(output_dim).describe("fix output dim");
+  DMLC_DECLARE_FIELD(pooled_size).describe("fix pooled size");
+    DMLC_DECLARE_FIELD(group_size).set_default(0).describe("fix group size");
+  }
+};
+
+template<typename xpu, typename DType>
+class PSROIPoolingOp : public Operator {
+ public:
+  explicit PSROIPoolingOp(PSROIPoolingParam p) {
+    this->param_ = p;
+  }
+
+  virtual void Forward(const OpContext &ctx,
+                       const std::vector<TBlob> &in_data,
+                       const std::vector<OpReqType> &req,
+                       const std::vector<TBlob> &out_data,
+                       const std::vector<TBlob> &aux_args) {
+    using namespace mshadow;
+    size_t expected = 2;
+    CHECK_EQ(in_data.size(), expected);
+    CHECK_EQ(out_data.size(), expected);
+    CHECK_EQ(out_data[psroipool::kOut].shape_[0], in_data[psroipool::kBox].shape_[0]);
+    CHECK_EQ(out_data[psroipool::kMappingChannel].shape_[0], in_data[psroipool::kBox].shape_[0]);
+    Stream<xpu> *s = ctx.get_stream<xpu>();
+
+    Tensor<xpu, 4, DType> data = in_data[psroipool::kData].get<xpu, 4, DType>(s);
+    Tensor<xpu, 2, DType> bbox = in_data[psroipool::kBox].get<xpu, 2, DType>(s);
+    Tensor<xpu, 4, DType> out = out_data[psroipool::kOut].get<xpu, 4, DType>(s);
+    Tensor<xpu, 4, DType> mapping_channel = out_data[psroipool::kMappingChannel].get<xpu, 4, DType>(s);
+    CHECK_EQ(data.CheckContiguous(), true);
+    CHECK_EQ(bbox.CheckContiguous(), true);
+    CHECK_EQ(out.CheckContiguous(), true);
+    CHECK_EQ(mapping_channel.CheckContiguous(), true);
+    out = -FLT_MAX;
+    mapping_channel = -1.0f;
+    PSROIPoolForward(out, data, bbox, mapping_channel, param_.spatial_scale, param_.output_dim, param_.group_size);
+  }
+
+  virtual void Backward(const OpContext &ctx,
+                        const std::vector<TBlob> &out_grad,
+                        const std::vector<TBlob> &in_data,
+                        const std::vector<TBlob> &out_data,
+                        const std::vector<OpReqType> &req,
+                        const std::vector<TBlob> &in_grad,
+                        const std::vector<TBlob> &aux_args) {
+    using namespace mshadow;
+    size_t expected = 2;
+    CHECK_EQ(in_data.size(), expected);
+    CHECK_EQ(out_data.size(), expected);
+    CHECK_EQ(out_grad[psroipool::kOut].shape_[0], in_data[psroipool::kBox].shape_[0]);
+    CHECK_EQ(out_data[psroipool::kMappingChannel].shape_[0], in_data[psroipool::kBox].shape_[0]);
+    CHECK_NE(req[psroipool::kData], kWriteInplace) <<
+      "ROIPooling: Backward doesn't support kWriteInplace.";
+    CHECK_NE(req[psroipool::kBox], kWriteInplace) <<
+      "ROIPooling: Backward doesn't support kWriteInplace.";
+    Stream<xpu> *s = ctx.get_stream<xpu>();
+
+    Tensor<xpu, 4, DType> grad_out = out_grad[psroipool::kOut].get<xpu, 4, DType>(s);
+    Tensor<xpu, 2, DType> bbox = in_data[psroipool::kBox].get<xpu, 2, DType>(s);
+    Tensor<xpu, 4, DType> mapping_channel = out_data[psroipool::kMappingChannel].get<xpu, 4, DType>(s);
+    Tensor<xpu, 4, DType> grad_in = in_grad[psroipool::kData].get<xpu, 4, DType>(s);
+    Tensor<xpu, 2, DType> grad_roi = in_grad[psroipool::kBox].get<xpu, 2, DType>(s);
+
+    CHECK_EQ(grad_out.CheckContiguous(), true);
+    CHECK_EQ(bbox.CheckContiguous(), true);
+    CHECK_EQ(mapping_channel.CheckContiguous(), true);
+    CHECK_EQ(grad_in.CheckContiguous(), true);
+
+    if (kAddTo == req[psroipool::kData] || kWriteTo == req[psroipool::kData]) {
+      if (kWriteTo == req[psroipool::kData]) {
+        grad_in = 0.0f;
+      }
+      PSROIPoolBackwardAcc(grad_in, grad_out, bbox, mapping_channel, param_.spatial_scale, param_.output_dim);
+    }
+    if (kWriteTo == req[psroipool::kBox]) {
+      grad_roi = 0.0f;
+    }
+
+  }
+
+ private:
+  PSROIPoolingParam param_;
+};  // class PSROIPoolingOp
+
+// Decalre Factory function, used for dispatch specialization
+template<typename xpu>
+Operator* CreateOp(PSROIPoolingParam param, int dtype);
+
+#if DMLC_USE_CXX11
+class PSROIPoolingProp : public OperatorProperty {
+ public:
+  std::vector<std::string> ListArguments() const override {
+    return {"data", "rois"};
+  }
+
+  std::vector<std::string> ListOutputs() const override {
+    return {"output", "maxidx"};
+  }
+
+  int NumOutputs() const override {
+    return 2;
+  }
+
+  int NumVisibleOutputs() const override {
+    return 1;
+  }
+
+  void Init(const std::vector<std::pair<std::string, std::string> >& kwargs) override {
+    param_.Init(kwargs);
+  if (param_.group_size == 0) {
+    param_.group_size = param_.pooled_size;
+  }
+  }
+
+  std::map<std::string, std::string> GetParams() const override {
+    return param_.__DICT__();
+  }
+
+  bool InferShape(std::vector<TShape> *in_shape,
+                  std::vector<TShape> *out_shape,
+                  std::vector<TShape> *aux_shape) const override {
+    using namespace mshadow;
+    CHECK_EQ(in_shape->size(), 2) << "Input:[data, rois]";
+
+    // data: [batch_size, c, h, w]
+    TShape dshape = in_shape->at(psroipool::kData);
+    CHECK_EQ(dshape.ndim(), 4) << "data should be a 4D tensor";
+
+    // bbox: [num_rois, 5]
+    TShape bshape = in_shape->at(psroipool::kBox);
+    CHECK_EQ(bshape.ndim(), 2) << "bbox should be a 2D tensor of shape [batch, 5]";
+    CHECK_EQ(bshape[1], 5) << "bbox should be a 2D tensor of shape [batch, 5]";
+
+    // out: [num_rois, c, pooled_h, pooled_w]
+    // mapping_channel: [num_rois, c, pooled_h, pooled_w]
+    out_shape->clear();
+    out_shape->push_back(
+         Shape4(bshape[0], param_.output_dim, param_.pooled_size, param_.pooled_size));
+    out_shape->push_back(
+         Shape4(bshape[0], param_.output_dim, param_.pooled_size, param_.pooled_size));
+    return true;
+  }
+
+  bool InferType(std::vector<int> *in_type,
+                 std::vector<int> *out_type,
+                 std::vector<int> *aux_type) const override {
+    CHECK_EQ(in_type->size(), 2);
+    int dtype = (*in_type)[0];
+    CHECK_EQ(dtype, (*in_type)[1]);
+    CHECK_NE(dtype, -1) << "Input must have specified type";
+
+    out_type->clear();
+    out_type->push_back(dtype);
+    out_type->push_back(dtype);
+    return true;
+  }
+
+  OperatorProperty* Copy() const override {
+    PSROIPoolingProp* psroi_pooling_sym = new PSROIPoolingProp();
+    psroi_pooling_sym->param_ = this->param_;
+    return psroi_pooling_sym;
+  }
+
+  std::string TypeString() const override {
+    return "_contrib_PSROIPooling";
+  }
+
+  // decalre dependency and inplace optimization options
+  std::vector<int> DeclareBackwardDependency(
+    const std::vector<int> &out_grad,
+    const std::vector<int> &in_data,
+    const std::vector<int> &out_data) const override {
+    return {out_grad[psroipool::kOut], in_data[psroipool::kBox], out_data[psroipool::kMappingChannel]};
+  }
+
+
+  Operator* CreateOperator(Context ctx) const override {
+    LOG(FATAL) << "Not Implemented.";
+    return NULL;
+  }
+
+  Operator* CreateOperatorEx(Context ctx, std::vector<TShape> *in_shape,
+                             std::vector<int> *in_type) const override;
+
+
+ private:
+  PSROIPoolingParam param_;
+};  // class PSROIPoolingProp
+#endif
+}  // namespace op
+}  // namespace mxnet
+#endif  // MXNET_OPERATOR_PSROI_POOLING_INL_H_
\ No newline at end of file
diff --git a/faster_rcnn/operator_cxx/psroi_pooling.cc b/faster_rcnn/operator_cxx/psroi_pooling.cc
new file mode 100644
index 0000000..4edaf4f
--- /dev/null
+++ b/faster_rcnn/operator_cxx/psroi_pooling.cc
@@ -0,0 +1,81 @@
+/*!
+ * Copyright (c) 2017 by Contributors
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file psroi_pooling.cc
+ * \brief psroi pooling operator
+ * \author Yi Li, Tairui Chen, Guodong Zhang, Jifeng Dai
+*/
+#include "./psroi_pooling-inl.h"
+#include <mshadow/base.h>
+#include <mshadow/tensor.h>
+#include <mshadow/packet-inl.h>
+#include <mshadow/dot_engine-inl.h>
+#include <cassert>
+
+using std::max;
+using std::min;
+using std::floor;
+using std::ceil;
+
+namespace mshadow {
+template<typename DType>
+inline void PSROIPoolForward(const Tensor<cpu, 4, DType> &out,
+                           const Tensor<cpu, 4, DType> &data,
+                           const Tensor<cpu, 2, DType> &bbox,
+                           const Tensor<cpu, 4, DType> &mapping_channel,
+                           const float spatial_scale_,
+                           const int output_dim_, 
+                           const int group_size_) {
+  // NOT_IMPLEMENTED;
+  return;
+}
+
+template<typename DType>
+inline void PSROIPoolBackwardAcc(const Tensor<cpu, 4, DType> &in_grad,
+                            const Tensor<cpu, 4, DType> &out_grad,
+                            const Tensor<cpu, 2, DType> &bbox,
+                            const Tensor<cpu, 4, DType> &mapping_channel,
+                            const float spatial_scale_,
+                            const int output_dim_) {
+  // NOT_IMPLEMENTED;
+  return;
+}
+}  // namespace mshadow
+
+namespace mxnet {
+namespace op {
+
+template<>
+Operator *CreateOp<cpu>(PSROIPoolingParam param, int dtype) {
+  Operator* op = NULL;
+  MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+    op = new PSROIPoolingOp<cpu, DType>(param);
+  });
+  return op;
+}
+
+Operator *PSROIPoolingProp::CreateOperatorEx(Context ctx, std::vector<TShape> *in_shape,
+                                           std::vector<int> *in_type) const {
+  std::vector<TShape> out_shape, aux_shape;
+  std::vector<int> out_type, aux_type;
+  CHECK(InferType(in_type, &out_type, &aux_type));
+  CHECK(InferShape(in_shape, &out_shape, &aux_shape));
+  DO_BIND_DISPATCH(CreateOp, param_, in_type->at(0));
+}
+
+DMLC_REGISTER_PARAMETER(PSROIPoolingParam);
+
+MXNET_REGISTER_OP_PROPERTY(_contrib_PSROIPooling, PSROIPoolingProp)
+.describe("Performs region-of-interest pooling on inputs. Resize bounding box coordinates by "
+"spatial_scale and crop input feature maps accordingly. The cropped feature maps are pooled "
+"by max pooling to a fixed size output indicated by pooled_size. batch_size will change to "
+"the number of region bounding boxes after PSROIPooling")
+.add_argument("data", "Symbol", "Input data to the pooling operator, a 4D Feature maps")
+.add_argument("rois", "Symbol", "Bounding box coordinates, a 2D array of "
+"[[batch_index, x1, y1, x2, y2]]. (x1, y1) and (x2, y2) are top left and down right corners "
+"of designated region of interest. batch_index indicates the index of corresponding image "
+"in the input data")
+.add_arguments(PSROIPoolingParam::__FIELDS__());
+}  // namespace op
+}  // namespace mxnet
\ No newline at end of file
diff --git a/faster_rcnn/operator_cxx/psroi_pooling.cu b/faster_rcnn/operator_cxx/psroi_pooling.cu
new file mode 100644
index 0000000..43c57ee
--- /dev/null
+++ b/faster_rcnn/operator_cxx/psroi_pooling.cu
@@ -0,0 +1,263 @@
+/*!
+ * Copyright (c) 2017 by Contributors
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
+ * \file psroi_pooling.cu
+ * \brief psroi pooling operator
+ * \author Yi Li, Tairui Chen, Guodong Zhang, Jifeng Dai
+*/
+#include "./psroi_pooling-inl.h"
+#include <mshadow/tensor.h>
+#include <mshadow/cuda/reduce.cuh>
+#include <algorithm>
+#include <vector>
+#include "../../common/cuda_utils.h"
+#include "../mxnet_op.h"
+
+#define PSROIPOOLING_CUDA_CHECK(condition) \
+  /* Code block avoids redefinition of cudaError_t error */ \
+  do { \
+    cudaError_t error = condition; \
+    CHECK_EQ(error, cudaSuccess) << " " << cudaGetErrorString(error); \
+  } while (0)
+#define CUDA_KERNEL_LOOP(i, n) \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
+      i < (n); \
+      i += blockDim.x * gridDim.x)
+
+namespace mshadow {
+namespace cuda {
+
+template <typename DType>
+__global__ void PSROIPoolForwardKernel(
+  const int count,
+  const DType* bottom_data,
+  const DType spatial_scale,
+  const int channels,
+  const int height, const int width,
+  const int pooled_height, const int pooled_width,
+  const DType* bottom_rois,
+  const int output_dim,
+  const int group_size,
+  DType* top_data,
+  DType* mapping_channel) {
+  CUDA_KERNEL_LOOP(index, count) {
+    // The output is in order (n, ctop, ph, pw)
+    int pw = index % pooled_width;
+    int ph = (index / pooled_width) % pooled_height;
+    int ctop = (index / pooled_width / pooled_height) % output_dim;
+    int n = index / pooled_width / pooled_height / output_dim;
+
+    // [start, end) interval for spatial sampling
+    const DType* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = offset_bottom_rois[0];
+    DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale;
+    DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale;
+    DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale;
+    DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale;
+
+    // Force too small ROIs to be 1x1
+    DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+    DType roi_height = max(roi_end_h - roi_start_h, 0.1);
+
+    // Compute w and h at bottom
+    DType bin_size_h = roi_height / static_cast<DType>(pooled_height);
+    DType bin_size_w = roi_width / static_cast<DType>(pooled_width);
+
+    int hstart = floor(static_cast<DType>(ph) * bin_size_h
+                        + roi_start_h);
+    int wstart = floor(static_cast<DType>(pw)* bin_size_w
+                        + roi_start_w);
+    int hend = ceil(static_cast<DType>(ph + 1) * bin_size_h
+                      + roi_start_h);
+    int wend = ceil(static_cast<DType>(pw + 1) * bin_size_w
+                      + roi_start_w);
+    // Add roi offsets and clip to input boundaries
+    hstart = min(max(hstart, 0), height);
+    hend = min(max(hend, 0), height);
+    wstart = min(max(wstart, 0),width);
+    wend = min(max(wend, 0), width);
+    bool is_empty = (hend <= hstart) || (wend <= wstart);
+
+    int gw = floor(static_cast<DType>(pw)* group_size / pooled_width);
+    int gh = floor(static_cast<DType>(ph)* group_size / pooled_height);
+    gw = min(max(gw, 0), group_size - 1);
+    gh = min(max(gh, 0), group_size - 1);
+    int c = (ctop*group_size + gh)*group_size + gw;
+
+    const DType* offset_bottom_data = bottom_data + (roi_batch_ind * channels + c) * height * width;
+    DType out_sum = 0;
+    for (int h = hstart; h < hend; ++h){
+      for (int w = wstart; w < wend; ++w){
+        int bottom_index = h*width + w;
+        out_sum += offset_bottom_data[bottom_index];
+      }
+    }
+
+    DType bin_area = (hend - hstart)*(wend - wstart);
+    top_data[index] = is_empty? (DType)0. : out_sum/bin_area;
+    mapping_channel[index] = c;
+  }
+}
+
+template<typename DType>
+inline void PSROIPoolForward(const Tensor<gpu, 4, DType> &out,
+                           const Tensor<gpu, 4, DType> &data,
+                           const Tensor<gpu, 2, DType> &bbox,
+                           const Tensor<gpu, 4, DType> &mapping_channel,
+                           const float spatial_scale,
+                           const int output_dim_,
+                           const int group_size_) {
+  // LOG(INFO) << "PSROIPoolForward";
+  const DType *bottom_data = data.dptr_;
+  const DType *bottom_rois = bbox.dptr_;
+  DType *top_data = out.dptr_;
+  DType *mapping_channel_ptr = mapping_channel.dptr_;
+  const int count = out.shape_.Size();
+  const int channels = data.size(1);
+  const int height = data.size(2);
+  const int width = data.size(3);
+  const int pooled_height = out.size(2);
+  const int pooled_width = out.size(3);
+  cudaStream_t stream = Stream<gpu>::GetStream(out.stream_);
+  PSROIPoolForwardKernel<DType> << <mxnet::op::mxnet_op::cuda_get_num_blocks(count),
+    kBaseThreadNum, 0, stream >> >(
+      count, bottom_data, spatial_scale, channels, height, width,
+      pooled_height, pooled_width, bottom_rois, output_dim_, group_size_, top_data, mapping_channel_ptr);
+  PSROIPOOLING_CUDA_CHECK(cudaPeekAtLastError());
+}
+
+
+template <typename DType>
+__global__ void PSROIPoolBackwardAccKernel(
+  const int count,
+  const DType* top_diff,
+  const DType* mapping_channel,
+  const int num_rois,
+  const DType spatial_scale,
+  const int channels,
+  const int height, const int width,
+  const int pooled_height, const int pooled_width,
+  const int output_dim,
+  DType* bottom_diff,
+  const DType* bottom_rois) {
+  CUDA_KERNEL_LOOP(index, count) {
+    // The output is in order (n, ctop, ph, pw)
+    int pw = index % pooled_width;
+    int ph = (index / pooled_width) % pooled_height;
+    int n = index / pooled_width / pooled_height / output_dim;
+
+    // [start, end) interval for spatial sampling
+    const DType* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = offset_bottom_rois[0];
+    DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale;
+    DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale;
+    DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale;
+    DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale;
+
+    // Force too small ROIs to be 1x1
+    DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
+    DType roi_height = max(roi_end_h - roi_start_h, 0.1);
+
+    // Compute w and h at bottom
+    DType bin_size_h = roi_height / static_cast<DType>(pooled_height);
+    DType bin_size_w = roi_width / static_cast<DType>(pooled_width);
+
+    int hstart = floor(static_cast<DType>(ph)* bin_size_h
+      + roi_start_h);
+    int wstart = floor(static_cast<DType>(pw)* bin_size_w
+      + roi_start_w);
+    int hend = ceil(static_cast<DType>(ph + 1) * bin_size_h
+      + roi_start_h);
+    int wend = ceil(static_cast<DType>(pw + 1) * bin_size_w
+      + roi_start_w);
+    // Add roi offsets and clip to input boundaries
+    hstart = min(max(hstart, 0), height);
+    hend = min(max(hend, 0), height);
+    wstart = min(max(wstart, 0), width);
+    wend = min(max(wend, 0), width);
+    bool is_empty = (hend <= hstart) || (wend <= wstart);
+
+    // Compute c at bottom
+    int c = mapping_channel[index];
+    DType* offset_bottom_diff = bottom_diff + (roi_batch_ind * channels + c) * height * width;
+    DType bin_area = (hend - hstart)*(wend - wstart);
+    DType diff_val = is_empty ? (DType)0. : top_diff[index] / bin_area;
+    for (int h = hstart; h < hend; ++h){
+      for (int w = wstart; w < wend; ++w){
+        int bottom_index = h*width + w;
+        // mxnet_gpu_atomic_add(diff_val, offset_bottom_diff + bottom_index);
+        atomicAdd(offset_bottom_diff + bottom_index, diff_val);
+      }
+    }
+  }
+}
+
+
+template<typename DType>
+inline void PSROIPoolBackwardAcc(const Tensor<gpu, 4, DType> &in_grad,
+                            const Tensor<gpu, 4, DType> &out_grad,
+                            const Tensor<gpu, 2, DType> &bbox,
+                            const Tensor<gpu, 4, DType> &mapping_channel,
+                            const float spatial_scale,
+                            const int output_dim_) {
+  // LOG(INFO) << "PSROIPoolBackward";
+  const DType *top_diff = out_grad.dptr_;
+  const DType *bottom_rois = bbox.dptr_;
+  DType *bottom_diff = in_grad.dptr_;
+  DType *mapping_channel_ptr = mapping_channel.dptr_;
+  const int count = out_grad.shape_.Size();
+  const int num_rois = bbox.size(0);
+  const int channels = in_grad.size(1);
+  const int height = in_grad.size(2);
+  const int width = in_grad.size(3);
+  const int pooled_height = out_grad.size(2);
+  const int pooled_width = out_grad.size(3);
+  cudaStream_t stream = Stream<gpu>::GetStream(in_grad.stream_);
+  PSROIPoolBackwardAccKernel<DType> << <mxnet::op::mxnet_op::cuda_get_num_blocks(count),
+    kBaseThreadNum, 0, stream >> >(
+      count, top_diff, mapping_channel_ptr, num_rois, spatial_scale, channels, height, width,
+      pooled_height, pooled_width, output_dim_, bottom_diff, bottom_rois);
+  PSROIPOOLING_CUDA_CHECK(cudaPeekAtLastError());
+}
+
+}  // namespace cuda
+
+template<typename DType>
+inline void PSROIPoolForward(const Tensor<gpu, 4, DType> &out,
+                           const Tensor<gpu, 4, DType> &data,
+                           const Tensor<gpu, 2, DType> &bbox,
+                           const Tensor<gpu, 4, DType> &mapping_channel,
+                           const float spatial_scale,
+                           const int output_dim_,
+                           const int group_size_) {
+  cuda::PSROIPoolForward(out, data, bbox, mapping_channel, spatial_scale, output_dim_, group_size_);
+}
+
+template<typename DType>
+inline void PSROIPoolBackwardAcc(const Tensor<gpu, 4, DType> &in_grad,
+                            const Tensor<gpu, 4, DType> &out_grad,
+                            const Tensor<gpu, 2, DType> &bbox,
+                            const Tensor<gpu, 4, DType> &mapping_channel,
+                            const float spatial_scale,
+                            const int output_dim_) {
+  cuda::PSROIPoolBackwardAcc(in_grad, out_grad, bbox, mapping_channel, spatial_scale, output_dim_);
+}
+
+}  // namespace mshadow
+
+
+namespace mxnet {
+namespace op {
+
+template<>
+Operator* CreateOp<gpu>(PSROIPoolingParam param, int dtype) {
+  Operator* op = NULL;
+  MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+    op = new PSROIPoolingOp<gpu, DType>(param);
+  });
+  return op;
+}
+
+}  // namespace op
+}  // namespace mxnet
diff --git a/faster_rcnn/operator_py/__init__.py b/faster_rcnn/operator_py/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/faster_rcnn/operator_py/box_annotator_ohem.py b/faster_rcnn/operator_py/box_annotator_ohem.py
new file mode 100644
index 0000000..f11b7b5
--- /dev/null
+++ b/faster_rcnn/operator_py/box_annotator_ohem.py
@@ -0,0 +1,86 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Yuwen Xiong
+# --------------------------------------------------------
+
+"""
+Proposal Target Operator selects foreground and background roi and assigns label, bbox_transform to them.
+"""
+
+import mxnet as mx
+import numpy as np
+from distutils.util import strtobool
+
+
+
+
+class BoxAnnotatorOHEMOperator(mx.operator.CustomOp):
+    def __init__(self, num_classes, num_reg_classes, roi_per_img):
+        super(BoxAnnotatorOHEMOperator, self).__init__()
+        self._num_classes = num_classes
+        self._num_reg_classes = num_reg_classes
+        self._roi_per_img = roi_per_img
+
+    def forward(self, is_train, req, in_data, out_data, aux):
+
+        cls_score    = in_data[0]
+        bbox_pred    = in_data[1]
+        labels       = in_data[2].asnumpy()
+        bbox_targets = in_data[3]
+        bbox_weights = in_data[4]
+
+        per_roi_loss_cls = mx.nd.SoftmaxActivation(cls_score) + 1e-14
+        per_roi_loss_cls = per_roi_loss_cls.asnumpy()
+        per_roi_loss_cls = per_roi_loss_cls[np.arange(per_roi_loss_cls.shape[0], dtype='int'), labels.astype('int')]
+        per_roi_loss_cls = -1 * np.log(per_roi_loss_cls)
+        per_roi_loss_cls = np.reshape(per_roi_loss_cls, newshape=(-1,))
+
+        per_roi_loss_bbox = bbox_weights * mx.nd.smooth_l1((bbox_pred - bbox_targets), scalar=1.0)
+        per_roi_loss_bbox = mx.nd.sum(per_roi_loss_bbox, axis=1).asnumpy()
+
+        top_k_per_roi_loss = np.argsort(per_roi_loss_cls + per_roi_loss_bbox)
+        labels_ohem = labels
+        labels_ohem[top_k_per_roi_loss[::-1][self._roi_per_img:]] = -1
+        bbox_weights_ohem = bbox_weights.asnumpy()
+        bbox_weights_ohem[top_k_per_roi_loss[::-1][self._roi_per_img:]] = 0
+
+        labels_ohem = mx.nd.array(labels_ohem)
+        bbox_weights_ohem = mx.nd.array(bbox_weights_ohem)
+
+        for ind, val in enumerate([labels_ohem, bbox_weights_ohem]):
+            self.assign(out_data[ind], req[ind], val)
+
+
+    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
+        for i in range(len(in_grad)):
+            self.assign(in_grad[i], req[i], 0)
+
+
+@mx.operator.register('BoxAnnotatorOHEM')
+class BoxAnnotatorOHEMProp(mx.operator.CustomOpProp):
+    def __init__(self, num_classes, num_reg_classes, roi_per_img):
+        super(BoxAnnotatorOHEMProp, self).__init__(need_top_grad=False)
+        self._num_classes = int(num_classes)
+        self._num_reg_classes = int(num_reg_classes)
+        self._roi_per_img = int(roi_per_img)
+
+    def list_arguments(self):
+        return ['cls_score', 'bbox_pred', 'labels', 'bbox_targets', 'bbox_weights']
+
+    def list_outputs(self):
+        return ['labels_ohem', 'bbox_weights_ohem']
+
+    def infer_shape(self, in_shape):
+        labels_shape = in_shape[2]
+        bbox_weights_shape = in_shape[4]
+
+        return in_shape, \
+               [labels_shape, bbox_weights_shape]
+
+    def create_operator(self, ctx, shapes, dtypes):
+        return BoxAnnotatorOHEMOperator(self._num_classes, self._num_reg_classes, self._roi_per_img)
+
+    def declare_backward_dependency(self, out_grad, in_data, out_data):
+        return []
diff --git a/faster_rcnn/operator_py/proposal.py b/faster_rcnn/operator_py/proposal.py
new file mode 100644
index 0000000..c8b868d
--- /dev/null
+++ b/faster_rcnn/operator_py/proposal.py
@@ -0,0 +1,230 @@
+"""
+Proposal Operator transform anchor coordinates into ROI coordinates with prediction results on
+classification probability and bounding box prediction results, and image size and scale information.
+"""
+
+import mxnet as mx
+import numpy as np
+import numpy.random as npr
+from distutils.util import strtobool
+
+from bbox.bbox_transform import bbox_pred, clip_boxes
+from rpn.generate_anchor import generate_anchors
+from nms.nms import py_nms_wrapper, cpu_nms_wrapper, gpu_nms_wrapper
+
+DEBUG = False
+
+
+class ProposalOperator(mx.operator.CustomOp):
+    def __init__(self, feat_stride, scales, ratios, output_score,
+                 rpn_pre_nms_top_n, rpn_post_nms_top_n, threshold, rpn_min_size):
+        super(ProposalOperator, self).__init__()
+        self._feat_stride = feat_stride
+        self._scales = np.fromstring(scales[1:-1], dtype=float, sep=',')
+        self._ratios = np.fromstring(ratios[1:-1], dtype=float, sep=',')
+        self._anchors = generate_anchors(base_size=self._feat_stride, scales=self._scales, ratios=self._ratios)
+        self._num_anchors = self._anchors.shape[0]
+        self._output_score = output_score
+        self._rpn_pre_nms_top_n = rpn_pre_nms_top_n
+        self._rpn_post_nms_top_n = rpn_post_nms_top_n
+        self._threshold = threshold
+        self._rpn_min_size = rpn_min_size
+
+        if DEBUG:
+            print 'feat_stride: {}'.format(self._feat_stride)
+            print 'anchors:'
+            print self._anchors
+
+    def forward(self, is_train, req, in_data, out_data, aux):
+        nms = gpu_nms_wrapper(self._threshold, in_data[0].context.device_id)
+
+        batch_size = in_data[0].shape[0]
+        if batch_size > 1:
+            raise ValueError("Sorry, multiple images each device is not implemented")
+
+        # for each (H, W) location i
+        #   generate A anchor boxes centered on cell i
+        #   apply predicted bbox deltas at cell i to each of the A anchors
+        # clip predicted boxes to image
+        # remove predicted boxes with either height or width < threshold
+        # sort all (proposal, score) pairs by score from highest to lowest
+        # take top pre_nms_topN proposals before NMS
+        # apply NMS with threshold 0.7 to remaining proposals
+        # take after_nms_topN proposals after NMS
+        # return the top proposals (-> RoIs top, scores top)
+
+        pre_nms_topN = self._rpn_pre_nms_top_n
+        post_nms_topN = self._rpn_post_nms_top_n
+        min_size = self._rpn_min_size
+
+        # the first set of anchors are background probabilities
+        # keep the second part
+        scores = in_data[0].asnumpy()[:, self._num_anchors:, :, :]
+        bbox_deltas = in_data[1].asnumpy()
+        im_info = in_data[2].asnumpy()[0, :]
+
+        if DEBUG:
+            print 'im_size: ({}, {})'.format(im_info[0], im_info[1])
+            print 'scale: {}'.format(im_info[2])
+
+        # 1. Generate proposals from bbox_deltas and shifted anchors
+        # use real image size instead of padded feature map sizes
+        height, width = int(im_info[0] / self._feat_stride), int(im_info[1] / self._feat_stride)
+
+        if DEBUG:
+            print 'score map size: {}'.format(scores.shape)
+            print "resudial: {}".format((scores.shape[2] - height, scores.shape[3] - width))
+
+        # Enumerate all shifts
+        shift_x = np.arange(0, width) * self._feat_stride
+        shift_y = np.arange(0, height) * self._feat_stride
+        shift_x, shift_y = np.meshgrid(shift_x, shift_y)
+        shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel())).transpose()
+
+        # Enumerate all shifted anchors:
+        #
+        # add A anchors (1, A, 4) to
+        # cell K shifts (K, 1, 4) to get
+        # shift anchors (K, A, 4)
+        # reshape to (K*A, 4) shifted anchors
+        A = self._num_anchors
+        K = shifts.shape[0]
+        anchors = self._anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))
+        anchors = anchors.reshape((K * A, 4))
+
+        # Transpose and reshape predicted bbox transformations to get them
+        # into the same order as the anchors:
+        #
+        # bbox deltas will be (1, 4 * A, H, W) format
+        # transpose to (1, H, W, 4 * A)
+        # reshape to (1 * H * W * A, 4) where rows are ordered by (h, w, a)
+        # in slowest to fastest order
+        bbox_deltas = self._clip_pad(bbox_deltas, (height, width))
+        bbox_deltas = bbox_deltas.transpose((0, 2, 3, 1)).reshape((-1, 4))
+
+        # Same story for the scores:
+        #
+        # scores are (1, A, H, W) format
+        # transpose to (1, H, W, A)
+        # reshape to (1 * H * W * A, 1) where rows are ordered by (h, w, a)
+        scores = self._clip_pad(scores, (height, width))
+        scores = scores.transpose((0, 2, 3, 1)).reshape((-1, 1))
+
+        # Convert anchors into proposals via bbox transformations
+        proposals = bbox_pred(anchors, bbox_deltas)
+
+        # 2. clip predicted boxes to image
+        proposals = clip_boxes(proposals, im_info[:2])
+
+        # 3. remove predicted boxes with either height or width < threshold
+        # (NOTE: convert min_size to input image scale stored in im_info[2])
+        keep = self._filter_boxes(proposals, min_size * im_info[2])
+        proposals = proposals[keep, :]
+        scores = scores[keep]
+
+        # 4. sort all (proposal, score) pairs by score from highest to lowest
+        # 5. take top pre_nms_topN (e.g. 6000)
+        order = scores.ravel().argsort()[::-1]
+        if pre_nms_topN > 0:
+            order = order[:pre_nms_topN]
+        proposals = proposals[order, :]
+        scores = scores[order]
+
+        # 6. apply nms (e.g. threshold = 0.7)
+        # 7. take after_nms_topN (e.g. 300)
+        # 8. return the top proposals (-> RoIs top)
+        det = np.hstack((proposals, scores)).astype(np.float32)
+        keep = nms(det)
+        if post_nms_topN > 0:
+            keep = keep[:post_nms_topN]
+        # pad to ensure output size remains unchanged
+        if len(keep) < post_nms_topN:
+            pad = npr.choice(keep, size=post_nms_topN - len(keep))
+            keep = np.hstack((keep, pad))
+        proposals = proposals[keep, :]
+        scores = scores[keep]
+
+        # Output rois array
+        # Our RPN implementation only supports a single input image, so all
+        # batch inds are 0
+        batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
+        blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))
+        self.assign(out_data[0], req[0], blob)
+
+        if self._output_score:
+            self.assign(out_data[1], req[1], scores.astype(np.float32, copy=False))
+
+    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
+        self.assign(in_grad[0], req[0], 0)
+        self.assign(in_grad[1], req[1], 0)
+        self.assign(in_grad[2], req[2], 0)
+
+    @staticmethod
+    def _filter_boxes(boxes, min_size):
+        """ Remove all boxes with any side smaller than min_size """
+        ws = boxes[:, 2] - boxes[:, 0] + 1
+        hs = boxes[:, 3] - boxes[:, 1] + 1
+        keep = np.where((ws >= min_size) & (hs >= min_size))[0]
+        return keep
+
+    @staticmethod
+    def _clip_pad(tensor, pad_shape):
+        """
+        Clip boxes of the pad area.
+        :param tensor: [n, c, H, W]
+        :param pad_shape: [h, w]
+        :return: [n, c, h, w]
+        """
+        H, W = tensor.shape[2:]
+        h, w = pad_shape
+
+        if h < H or w < W:
+            tensor = tensor[:, :, :h, :w].copy()
+
+        return tensor
+
+
+@mx.operator.register("proposal")
+class ProposalProp(mx.operator.CustomOpProp):
+    def __init__(self, feat_stride='16', scales='(8, 16, 32)', ratios='(0.5, 1, 2)', output_score='False',
+                 rpn_pre_nms_top_n='6000', rpn_post_nms_top_n='300', threshold='0.3', rpn_min_size='16'):
+        super(ProposalProp, self).__init__(need_top_grad=False)
+        self._feat_stride = int(feat_stride)
+        self._scales = scales
+        self._ratios = ratios
+        self._output_score = strtobool(output_score)
+        self._rpn_pre_nms_top_n = int(rpn_pre_nms_top_n)
+        self._rpn_post_nms_top_n = int(rpn_post_nms_top_n)
+        self._threshold = float(threshold)
+        self._rpn_min_size = int(rpn_min_size)
+
+    def list_arguments(self):
+        return ['cls_prob', 'bbox_pred', 'im_info']
+
+    def list_outputs(self):
+        if self._output_score:
+            return ['output', 'score']
+        else:
+            return ['output']
+
+    def infer_shape(self, in_shape):
+        cls_prob_shape = in_shape[0]
+        bbox_pred_shape = in_shape[1]
+        assert cls_prob_shape[0] == bbox_pred_shape[0], 'ROI number does not equal in cls and reg'
+
+        batch_size = cls_prob_shape[0]
+        im_info_shape = (batch_size, 3)
+        output_shape = (self._rpn_post_nms_top_n, 5)
+        score_shape = (self._rpn_post_nms_top_n, 1)
+
+        if self._output_score:
+            return [cls_prob_shape, bbox_pred_shape, im_info_shape], [output_shape, score_shape]
+        else:
+            return [cls_prob_shape, bbox_pred_shape, im_info_shape], [output_shape]
+
+    def create_operator(self, ctx, shapes, dtypes):
+        return ProposalOperator(self._feat_stride, self._scales, self._ratios, self._output_score,
+                                self._rpn_pre_nms_top_n, self._rpn_post_nms_top_n, self._threshold, self._rpn_min_size)
+
+    def declare_backward_dependency(self, out_grad, in_data, out_data):
+        return []
diff --git a/faster_rcnn/operator_py/proposal_target.py b/faster_rcnn/operator_py/proposal_target.py
new file mode 100644
index 0000000..d56ae5b
--- /dev/null
+++ b/faster_rcnn/operator_py/proposal_target.py
@@ -0,0 +1,116 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------
+
+"""
+Proposal Target Operator selects foreground and background roi and assigns label, bbox_transform to them.
+"""
+
+import mxnet as mx
+import numpy as np
+from distutils.util import strtobool
+from easydict import EasyDict as edict
+import cPickle
+
+
+from core.rcnn import sample_rois
+
+DEBUG = False
+
+
+class ProposalTargetOperator(mx.operator.CustomOp):
+    def __init__(self, num_classes, batch_images, batch_rois, cfg, fg_fraction):
+        super(ProposalTargetOperator, self).__init__()
+        self._num_classes = num_classes
+        self._batch_images = batch_images
+        self._batch_rois = batch_rois
+        self._cfg = cfg
+        self._fg_fraction = fg_fraction
+
+        if DEBUG:
+            self._count = 0
+            self._fg_num = 0
+            self._bg_num = 0
+
+    def forward(self, is_train, req, in_data, out_data, aux):
+        assert self._batch_rois == -1 or self._batch_rois % self._batch_images == 0, \
+            'batchimages {} must devide batch_rois {}'.format(self._batch_images, self._batch_rois)
+        all_rois = in_data[0].asnumpy()
+        gt_boxes = in_data[1].asnumpy()
+
+        if self._batch_rois == -1:
+            rois_per_image = all_rois.shape[0] + gt_boxes.shape[0]
+            fg_rois_per_image = rois_per_image
+        else:
+            rois_per_image = self._batch_rois / self._batch_images
+            fg_rois_per_image = np.round(self._fg_fraction * rois_per_image).astype(int)
+
+
+        # Include ground-truth boxes in the set of candidate rois
+        zeros = np.zeros((gt_boxes.shape[0], 1), dtype=gt_boxes.dtype)
+        all_rois = np.vstack((all_rois, np.hstack((zeros, gt_boxes[:, :-1]))))
+        # Sanity check: single batch only
+        assert np.all(all_rois[:, 0] == 0), 'Only single item batches are supported'
+
+        rois, labels, bbox_targets, bbox_weights = \
+            sample_rois(all_rois, fg_rois_per_image, rois_per_image, self._num_classes, self._cfg, gt_boxes=gt_boxes)
+
+        if DEBUG:
+            print "labels=", labels
+            print 'num fg: {}'.format((labels > 0).sum())
+            print 'num bg: {}'.format((labels == 0).sum())
+            self._count += 1
+            self._fg_num += (labels > 0).sum()
+            self._bg_num += (labels == 0).sum()
+            print "self._count=", self._count
+            print 'num fg avg: {}'.format(self._fg_num / self._count)
+            print 'num bg avg: {}'.format(self._bg_num / self._count)
+            print 'ratio: {:.3f}'.format(float(self._fg_num) / float(self._bg_num))
+
+        for ind, val in enumerate([rois, labels, bbox_targets, bbox_weights]):
+            self.assign(out_data[ind], req[ind], val)
+
+    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
+        self.assign(in_grad[0], req[0], 0)
+        self.assign(in_grad[1], req[1], 0)
+
+
+@mx.operator.register('proposal_target')
+class ProposalTargetProp(mx.operator.CustomOpProp):
+    def __init__(self, num_classes, batch_images, batch_rois, cfg, fg_fraction='0.25'):
+        super(ProposalTargetProp, self).__init__(need_top_grad=False)
+        self._num_classes = int(num_classes)
+        self._batch_images = int(batch_images)
+        self._batch_rois = int(batch_rois)
+        self._cfg = cPickle.loads(cfg)
+        self._fg_fraction = float(fg_fraction)
+
+    def list_arguments(self):
+        return ['rois', 'gt_boxes']
+
+    def list_outputs(self):
+        return ['rois_output', 'label', 'bbox_target', 'bbox_weight']
+
+    def infer_shape(self, in_shape):
+        rpn_rois_shape = in_shape[0]
+        gt_boxes_shape = in_shape[1]
+
+        rois = rpn_rois_shape[0] + gt_boxes_shape[0] if self._batch_rois == -1 else self._batch_rois
+
+        output_rois_shape = (rois, 5)
+        label_shape = (rois, )
+        bbox_target_shape = (rois, self._num_classes * 4)
+        bbox_weight_shape = (rois, self._num_classes * 4)
+
+        return [rpn_rois_shape, gt_boxes_shape], \
+               [output_rois_shape, label_shape, bbox_target_shape, bbox_weight_shape]
+
+    def create_operator(self, ctx, shapes, dtypes):
+        return ProposalTargetOperator(self._num_classes, self._batch_images, self._batch_rois, self._cfg, self._fg_fraction)
+
+    def declare_backward_dependency(self, out_grad, in_data, out_data):
+        return []
diff --git a/faster_rcnn/symbols/__init__.py b/faster_rcnn/symbols/__init__.py
new file mode 100644
index 0000000..1f747da
--- /dev/null
+++ b/faster_rcnn/symbols/__init__.py
@@ -0,0 +1,2 @@
+import resnet_v1_101_rcnn
+import resnet_v1_101_rcnn_dcn
diff --git a/faster_rcnn/symbols/resnet_v1_101_rcnn.py b/faster_rcnn/symbols/resnet_v1_101_rcnn.py
new file mode 100644
index 0000000..12059ba
--- /dev/null
+++ b/faster_rcnn/symbols/resnet_v1_101_rcnn.py
@@ -0,0 +1,1009 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Guodong Zhang
+# Modified by Bin Xiao
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+from operator_py.proposal import *
+from operator_py.proposal_target import *
+from operator_py.box_annotator_ohem import *
+
+
+class resnet_v1_101_rcnn(Symbol):
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 512
+        self.units = (3, 4, 23, 3) # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_v1_conv4(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0), kernel=(1, 1),
+                                              stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0), kernel=(1, 1),
+                                               stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a, act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b, act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a, act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b, act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a, act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b, act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a, act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b, act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a, act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b, act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a, act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b, act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a, act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b, act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a, act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b, act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a, act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b, act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a, act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b, act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a, act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b, act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a, act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b, act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a, act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b, act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a, act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b, act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a, act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b, act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a, act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b, act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a, act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b, act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a, act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b, act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a, act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b, act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a, act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b, act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a, act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b, act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a, act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b, act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a, act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b, act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a, act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b, act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a, act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b, act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+        return res4b22_relu
+        
+    def get_resnet_v1_conv5(self, conv_feat):
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=conv_feat, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=conv_feat, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b = mx.symbol.Convolution(name='res5a_branch2b', data=res5a_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b = mx.symbol.Convolution(name='res5b_branch2b', data=res5b_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b = mx.symbol.Convolution(name='res5c_branch2b', data=res5c_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+        return res5c_relu
+
+    def get_rpn(self, conv_feat, num_anchors):
+        rpn_conv = mx.sym.Convolution(
+            data=conv_feat, kernel=(3, 3), pad=(1, 1), num_filter=512, name="rpn_conv_3x3")
+        rpn_relu = mx.sym.Activation(data=rpn_conv, act_type="relu", name="rpn_relu")
+        rpn_cls_score = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=2 * num_anchors, name="rpn_cls_score")
+        rpn_bbox_pred = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=4 * num_anchors, name="rpn_bbox_pred")
+        return rpn_cls_score, rpn_bbox_pred
+
+    def get_symbol(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+            gt_boxes = mx.sym.Variable(name="gt_boxes")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                   normalization='valid', use_ignore=True, ignore_label=-1, name="rpn_cls_prob")
+
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0, data=(rpn_bbox_pred - rpn_bbox_target))
+
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_, grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+
+            # ROI proposal
+            rpn_cls_act = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_act")
+            rpn_cls_act_reshape = mx.sym.Reshape(
+                data=rpn_cls_act, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_act_reshape')
+            if cfg.TRAIN.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            # ROI proposal target
+            gt_boxes_reshape = mx.sym.Reshape(data=gt_boxes, shape=(-1, 5), name='gt_boxes_reshape')
+            rois, label, bbox_target, bbox_weight = mx.sym.Custom(rois=rois, gt_boxes=gt_boxes_reshape,
+                                                                  op_type='proposal_target',
+                                                                  num_classes=num_reg_classes,
+                                                                  batch_images=cfg.TRAIN.BATCH_IMAGES,
+                                                                  batch_rois=cfg.TRAIN.BATCH_ROIS,
+                                                                  cfg=cPickle.dumps(cfg),
+                                                                  fg_fraction=cfg.TRAIN.FG_FRACTION)
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=256, name="conv_new_1")
+        conv_new_1_relu = mx.sym.Activation(data=conv_new_1, act_type='relu', name='conv_new_1_relu')
+
+        roi_pool = mx.symbol.ROIPooling(
+            name='roi_pool', data=conv_new_1_relu, rois=rois, pooled_size=(7, 7), spatial_scale=0.0625)
+
+        # 2 fc
+        fc_new_1 = mx.symbol.FullyConnected(name='fc_new_1', data=roi_pool, num_hidden=1024)
+        fc_new_1_relu = mx.sym.Activation(data=fc_new_1, act_type='relu', name='fc_new_1_relu')
+
+        fc_new_2 = mx.symbol.FullyConnected(name='fc_new_2', data=fc_new_1_relu, num_hidden=1024)
+        fc_new_2_relu = mx.sym.Activation(data=fc_new_2, act_type='relu', name='fc_new_2_relu')
+
+        # cls_score/bbox_pred
+        cls_score = mx.symbol.FullyConnected(name='cls_score', data=fc_new_2_relu, num_hidden=num_classes)
+        bbox_pred = mx.symbol.FullyConnected(name='bbox_pred', data=fc_new_2_relu, num_hidden=num_reg_classes * 4)
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes,
+                                                               roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem,
+                                                normalization='valid', use_ignore=True, ignore_label=-1)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                                  data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                rcnn_label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid')
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                            data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+                rcnn_label = label
+
+            # reshape output
+            rcnn_label = mx.sym.Reshape(data=rcnn_label, shape=(cfg.TRAIN.BATCH_IMAGES, -1), name='label_reshape')
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_loss_reshape')
+            group = mx.sym.Group([rpn_cls_prob, rpn_bbox_loss, cls_prob, bbox_loss, mx.sym.BlockGrad(rcnn_label)])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([rois, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def get_symbol_rpn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1,
+                                                name="rpn_cls_prob",
+                                                grad_scale=1.0)
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0,
+                                                                data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+            group = mx.symbol.Group([rpn_cls_prob, rpn_bbox_loss])
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois, score = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    output_score=True,
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois, score = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    output_score=True,
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+                group = mx.symbol.Group([rois, score])
+        self.sym = group
+        return group
+
+    def get_symbol_rcnn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+
+        # input init
+        if is_train:
+            data = mx.symbol.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            label = mx.symbol.Variable(name='label')
+            bbox_target = mx.symbol.Variable(name='bbox_target')
+            bbox_weight = mx.symbol.Variable(name='bbox_weight')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+            label = mx.symbol.Reshape(data=label, shape=(-1,), name='label_reshape')
+            bbox_target = mx.symbol.Reshape(data=bbox_target, shape=(-1, 4 * num_classes), name='bbox_target_reshape')
+            bbox_weight = mx.symbol.Reshape(data=bbox_weight, shape=(-1, 4 * num_classes), name='bbox_weight_reshape')
+        else:
+            data = mx.sym.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=256, name="conv_new_1")
+        conv_new_1_relu = mx.sym.Activation(data=conv_new_1, act_type='relu', name='conv_new_1_relu')
+
+        roi_pool = mx.symbol.ROIPooling(
+            name='roi_pool', data=conv_new_1_relu, rois=rois, pooled_size=(7, 7), spatial_scale=0.0625)
+
+        # 2 fc
+        fc_new_1 = mx.symbol.FullyConnected(name='fc_new_1', data=roi_pool, num_hidden=1024)
+        fc_new_1_relu = mx.sym.Activation(data=fc_new_1, act_type='relu', name='fc_new_1_relu')
+
+        fc_new_2 = mx.symbol.FullyConnected(name='fc_new_2', data=fc_new_1_relu, num_hidden=1024)
+        fc_new_2_relu = mx.sym.Activation(data=fc_new_2, act_type='relu', name='fc_new_2_relu')
+
+        # cls_score/bbox_pred
+        cls_score = mx.symbol.FullyConnected(name='cls_score', data=fc_new_2_relu, num_hidden=num_classes)
+        bbox_pred = mx.symbol.FullyConnected(name='bbox_pred', data=fc_new_2_relu, num_hidden=num_reg_classes * 4)
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes,
+                                                               roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem,
+                                                normalization='valid', use_ignore=True, ignore_label=-1, grad_scale=1.0)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                                  data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid',
+                                                grad_scale=1.0)
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                            data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+
+            # reshape output
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_loss_reshape')
+            group = mx.sym.Group([cls_prob, bbox_loss])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def init_weight_rcnn(self, cfg, arg_params, aux_params):
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['conv_new_1_bias'])
+        arg_params['fc_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_1_weight'])
+        arg_params['fc_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_1_bias'])
+        arg_params['fc_new_2_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_2_weight'])
+        arg_params['fc_new_2_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_2_bias'])
+        arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['cls_score_weight'])
+        arg_params['cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['cls_score_bias'])
+        arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['bbox_pred_weight'])
+        arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['bbox_pred_bias'])
+
+    def init_weight_rpn(self, cfg, arg_params, aux_params):
+        arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_conv_3x3_weight'])
+        arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_conv_3x3_bias'])
+        arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01,
+                                                              shape=self.arg_shape_dict['rpn_cls_score_weight'])
+        arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_cls_score_bias'])
+        arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01,
+                                                              shape=self.arg_shape_dict['rpn_bbox_pred_weight'])
+        arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_bbox_pred_bias'])
+
+    def init_weight(self, cfg, arg_params, aux_params):
+        self.init_weight_rpn(cfg, arg_params, aux_params)
+        self.init_weight_rcnn(cfg, arg_params, aux_params)
+
diff --git a/faster_rcnn/symbols/resnet_v1_101_rcnn_dcn.py b/faster_rcnn/symbols/resnet_v1_101_rcnn_dcn.py
new file mode 100644
index 0000000..376a359
--- /dev/null
+++ b/faster_rcnn/symbols/resnet_v1_101_rcnn_dcn.py
@@ -0,0 +1,1127 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Guodong Zhang
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+from operator_py.proposal import *
+from operator_py.proposal_target import *
+from operator_py.box_annotator_ohem import *
+
+
+class resnet_v1_101_rcnn_dcn(Symbol):
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 512
+        self.units = (3, 4, 23, 3)  # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_v1_conv4(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False,
+                                       eps=self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0),
+                                              kernel=(1, 1),
+                                              stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps=self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1),
+                                               stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64,
+                                               pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256,
+                                               pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64,
+                                               pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256,
+                                               pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64,
+                                               pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256,
+                                               pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps=self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128,
+                                               pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512,
+                                               pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a,
+                                                    act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b,
+                                                    act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a,
+                                                    act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b,
+                                                    act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a,
+                                                    act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b,
+                                                    act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True,
+                                           fix_gamma=False, eps=self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256,
+                                               pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024,
+                                               pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a,
+                                                    act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b,
+                                                    act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a,
+                                                    act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b,
+                                                    act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a,
+                                                    act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b,
+                                                    act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a,
+                                                    act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b,
+                                                    act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a,
+                                                    act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b,
+                                                    act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a,
+                                                    act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b,
+                                                    act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a,
+                                                    act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b,
+                                                    act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a,
+                                                    act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b,
+                                                    act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a,
+                                                    act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b,
+                                                    act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a,
+                                                     act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b,
+                                                     act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a,
+                                                     act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b,
+                                                     act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a,
+                                                     act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b,
+                                                     act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a,
+                                                     act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b,
+                                                     act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a,
+                                                     act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b,
+                                                     act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a,
+                                                     act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b,
+                                                     act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a,
+                                                     act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b,
+                                                     act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a,
+                                                     act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b,
+                                                     act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a,
+                                                     act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b,
+                                                     act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a,
+                                                     act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b,
+                                                     act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a,
+                                                     act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b,
+                                                     act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a,
+                                                     act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b,
+                                                     act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a,
+                                                     act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b,
+                                                     act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+        return res4b22_relu
+
+    def get_resnet_v1_conv5(self, conv_feat):
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=conv_feat, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=conv_feat, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b_offset = mx.symbol.Convolution(name='res5a_branch2b_offset', data = res5a_branch2a_relu,
+                                                      num_filter=72, pad=(2, 2), kernel=(3, 3), stride=(1, 1), dilate=(2, 2), cudnn_off=True)
+        res5a_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5a_branch2b', data=res5a_branch2a_relu, offset=res5a_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=4,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b_offset = mx.symbol.Convolution(name='res5b_branch2b_offset', data = res5b_branch2a_relu,
+                                                      num_filter=72, pad=(2, 2), kernel=(3, 3), stride=(1, 1), dilate=(2, 2), cudnn_off=True)
+        res5b_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5b_branch2b', data=res5b_branch2a_relu, offset=res5b_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=4,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b_offset = mx.symbol.Convolution(name='res5c_branch2b_offset', data = res5c_branch2a_relu,
+                                                      num_filter=72, pad=(2, 2), kernel=(3, 3), stride=(1, 1), dilate=(2, 2), cudnn_off=True)
+        res5c_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5c_branch2b', data=res5c_branch2a_relu, offset=res5c_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=4,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+        return res5c_relu
+
+
+    def get_rpn(self, conv_feat, num_anchors):
+        rpn_conv = mx.sym.Convolution(
+            data=conv_feat, kernel=(3, 3), pad=(1, 1), num_filter=512, name="rpn_conv_3x3")
+        rpn_relu = mx.sym.Activation(data=rpn_conv, act_type="relu", name="rpn_relu")
+        rpn_cls_score = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=2 * num_anchors, name="rpn_cls_score")
+        rpn_bbox_pred = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=4 * num_anchors, name="rpn_bbox_pred")
+        return rpn_cls_score, rpn_bbox_pred
+
+    def get_symbol(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+            gt_boxes = mx.sym.Variable(name="gt_boxes")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1,
+                                                name="rpn_cls_prob")
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0,
+                                                                data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+
+            # ROI proposal
+            rpn_cls_act = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_act")
+            rpn_cls_act_reshape = mx.sym.Reshape(
+                data=rpn_cls_act, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_act_reshape')
+            if cfg.TRAIN.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            # ROI proposal target
+            gt_boxes_reshape = mx.sym.Reshape(data=gt_boxes, shape=(-1, 5), name='gt_boxes_reshape')
+            rois, label, bbox_target, bbox_weight = mx.sym.Custom(rois=rois, gt_boxes=gt_boxes_reshape,
+                                                                  op_type='proposal_target',
+                                                                  num_classes=num_reg_classes,
+                                                                  batch_images=cfg.TRAIN.BATCH_IMAGES,
+                                                                  batch_rois=cfg.TRAIN.BATCH_ROIS,
+                                                                  cfg=cPickle.dumps(cfg),
+                                                                  fg_fraction=cfg.TRAIN.FG_FRACTION)
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=256, name="conv_new_1")
+        conv_new_1_relu = mx.sym.Activation(data=conv_new_1, act_type='relu', name='conv_new_1_relu')
+
+        offset_t = mx.contrib.sym.DeformablePSROIPooling(name='offset_t', data=conv_new_1_relu, rois=rois, group_size=1, pooled_size=7,
+                                                         sample_per_part=4, no_trans=True, part_size=7, output_dim=256, spatial_scale=0.0625)
+        offset = mx.sym.FullyConnected(name='offset', data=offset_t, num_hidden=7 * 7 * 2, lr_mult=0.01)
+        offset_reshape = mx.sym.Reshape(data=offset, shape=(-1, 2, 7, 7), name="offset_reshape")
+
+        deformable_roi_pool = mx.contrib.sym.DeformablePSROIPooling(name='deformable_roi_pool', data=conv_new_1_relu, rois=rois,
+                                                                    trans=offset_reshape, group_size=1, pooled_size=7, sample_per_part=4,
+                                                                    no_trans=False, part_size=7, output_dim=256, spatial_scale=0.0625, trans_std=0.1)
+        # 2 fc
+        fc_new_1 = mx.sym.FullyConnected(name='fc_new_1', data=deformable_roi_pool, num_hidden=1024)
+        fc_new_1_relu = mx.sym.Activation(data=fc_new_1, act_type='relu', name='fc_new_1_relu')
+
+        fc_new_2 = mx.sym.FullyConnected(name='fc_new_2', data=fc_new_1_relu, num_hidden=1024)
+        fc_new_2_relu = mx.sym.Activation(data=fc_new_2, act_type='relu', name='fc_new_2_relu')
+
+        # cls_score/bbox_pred
+        cls_score = mx.sym.FullyConnected(name='cls_score', data=fc_new_2_relu, num_hidden=num_classes)
+        bbox_pred = mx.sym.FullyConnected(name='bbox_pred', data=fc_new_2_relu, num_hidden=num_reg_classes * 4)
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes,
+                                                               roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem,
+                                                normalization='valid', use_ignore=True, ignore_label=-1)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                                  data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                rcnn_label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid')
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                            data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+                rcnn_label = label
+
+            # reshape output
+            rcnn_label = mx.sym.Reshape(data=rcnn_label, shape=(cfg.TRAIN.BATCH_IMAGES, -1), name='label_reshape')
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_loss_reshape')
+            group = mx.sym.Group([rpn_cls_prob, rpn_bbox_loss, cls_prob, bbox_loss, mx.sym.BlockGrad(rcnn_label)])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([rois, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def get_symbol_rpn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1,
+                                                name="rpn_cls_prob",
+                                                grad_scale=1.0)
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0,
+                                                                data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+            group = mx.symbol.Group([rpn_cls_prob, rpn_bbox_loss])
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois, score = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    output_score=True,
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois, score = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    output_score=True,
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+                group = mx.symbol.Group([rois, score])
+        self.sym = group
+        return group
+
+    def get_symbol_rcnn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+
+        # input init
+        if is_train:
+            data = mx.symbol.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            label = mx.symbol.Variable(name='label')
+            bbox_target = mx.symbol.Variable(name='bbox_target')
+            bbox_weight = mx.symbol.Variable(name='bbox_weight')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+            label = mx.symbol.Reshape(data=label, shape=(-1,), name='label_reshape')
+            bbox_target = mx.symbol.Reshape(data=bbox_target, shape=(-1, 4 * num_classes), name='bbox_target_reshape')
+            bbox_weight = mx.symbol.Reshape(data=bbox_weight, shape=(-1, 4 * num_classes), name='bbox_weight_reshape')
+        else:
+            data = mx.sym.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=256, name="conv_new_1")
+        conv_new_1_relu = mx.sym.Activation(data=conv_new_1, act_type='relu', name='conv_new_1_relu')
+
+        offset_t = mx.contrib.sym.DeformablePSROIPooling(name='offset_t', data=conv_new_1_relu, rois=rois, group_size=1, pooled_size=7,
+                                                         sample_per_part=4, no_trans=True, part_size=7, output_dim=256, spatial_scale=0.0625)
+        offset = mx.sym.FullyConnected(name='offset', data=offset_t, num_hidden=7 * 7 * 2, lr_mult=0.01)
+        offset_reshape = mx.sym.Reshape(data=offset, shape=(-1, 2, 7, 7), name="offset_reshape")
+
+        deformable_roi_pool = mx.contrib.sym.DeformablePSROIPooling(name='deformable_roi_pool', data=conv_new_1_relu, rois=rois,
+                                                                    trans=offset_reshape, group_size=1, pooled_size=7, sample_per_part=4,
+                                                                    no_trans=False, part_size=7, output_dim=256, spatial_scale=0.0625, trans_std=0.1)
+
+        # 2 fc
+        fc_new_1 = mx.sym.FullyConnected(name='fc_new_1', data=deformable_roi_pool, num_hidden=1024)
+        fc_new_1_relu = mx.sym.Activation(data=fc_new_1, act_type='relu', name='fc_new_1_relu')
+
+        fc_new_2 = mx.sym.FullyConnected(name='fc_new_2', data=fc_new_1_relu, num_hidden=1024)
+        fc_new_2_relu = mx.sym.Activation(data=fc_new_2, act_type='relu', name='fc_new_2_relu')
+
+        # cls_score/bbox_pred
+        cls_score = mx.sym.FullyConnected(name='cls_score', data=fc_new_2_relu, num_hidden=num_classes)
+        bbox_pred = mx.sym.FullyConnected(name='bbox_pred', data=fc_new_2_relu, num_hidden=num_reg_classes * 4)
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes,
+                                                               roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem,
+                                                normalization='valid', use_ignore=True, ignore_label=-1, grad_scale=1.0)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                                  data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_,
+                                            grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid',
+                                                grad_scale=1.0)
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0,
+                                                            data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+
+            # reshape output
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_loss_reshape')
+            group = mx.sym.Group([cls_prob, bbox_loss])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def init_weight(self, cfg, arg_params, aux_params):
+        arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_conv_3x3_weight'])
+        arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_conv_3x3_bias'])
+        arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_cls_score_weight'])
+        arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_cls_score_bias'])
+        arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_bbox_pred_weight'])
+        arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_bbox_pred_bias'])
+        arg_params['res5a_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_weight'])
+        arg_params['res5a_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_bias'])
+        arg_params['res5b_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_weight'])
+        arg_params['res5b_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_bias'])
+        arg_params['res5c_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_weight'])
+        arg_params['res5c_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_bias'])
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['conv_new_1_bias'])
+        arg_params['offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['offset_weight'])
+        arg_params['offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['offset_bias'])
+        arg_params['fc_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_1_weight'])
+        arg_params['fc_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_1_bias'])
+        arg_params['fc_new_2_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_2_weight'])
+        arg_params['fc_new_2_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_2_bias'])
+        arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['cls_score_weight'])
+        arg_params['cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['cls_score_bias'])
+        arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['bbox_pred_weight'])
+        arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['bbox_pred_bias'])
+
+    def init_weight_rpn(self, cfg, arg_params, aux_params):
+        arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_conv_3x3_weight'])
+        arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_conv_3x3_bias'])
+        arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01,
+                                                              shape=self.arg_shape_dict['rpn_cls_score_weight'])
+        arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_cls_score_bias'])
+        arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01,
+                                                              shape=self.arg_shape_dict['rpn_bbox_pred_weight'])
+        arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_bbox_pred_bias'])
+
+    def init_weight_rcnn(self, cfg, arg_params, aux_params):
+        arg_params['res5a_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_weight'])
+        arg_params['res5a_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_bias'])
+        arg_params['res5b_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_weight'])
+        arg_params['res5b_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_bias'])
+        arg_params['res5c_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_weight'])
+        arg_params['res5c_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_bias'])
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['conv_new_1_bias'])
+        arg_params['offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['offset_weight'])
+        arg_params['offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['offset_bias'])
+        arg_params['fc_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_1_weight'])
+        arg_params['fc_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_1_bias'])
+        arg_params['fc_new_2_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['fc_new_2_weight'])
+        arg_params['fc_new_2_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['fc_new_2_bias'])
+        arg_params['cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['cls_score_weight'])
+        arg_params['cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['cls_score_bias'])
+        arg_params['bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['bbox_pred_weight'])
+        arg_params['bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['bbox_pred_bias'])
+
diff --git a/faster_rcnn/test.py b/faster_rcnn/test.py
new file mode 100644
index 0000000..fff50a8
--- /dev/null
+++ b/faster_rcnn/test.py
@@ -0,0 +1,54 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------		
+										  
+import _init_paths
+
+import argparse
+import os
+import sys
+import time
+import logging
+from config.config import config, update_config
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Test a Faster R-CNN network')
+    # general
+    parser.add_argument('--cfg', help='experiment configure file name', required=True, type=str)
+
+    args, rest = parser.parse_known_args()
+    update_config(args.cfg)
+
+    # rcnn
+    parser.add_argument('--vis', help='turn on visualization', action='store_true')
+    parser.add_argument('--ignore_cache', help='ignore cached results boxes', action='store_true')
+    parser.add_argument('--thresh', help='valid detection threshold', default=1e-3, type=float)
+    parser.add_argument('--shuffle', help='shuffle data on visualization', action='store_true')
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+curr_path = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(curr_path, '../external/mxnet', config.MXNET_VERSION))
+
+import mxnet as mx
+from function.test_rcnn import test_rcnn
+from utils.create_logger import create_logger
+
+
+def main():
+    ctx = [mx.gpu(int(i)) for i in config.gpus.split(',')]
+    print args
+
+    logger, final_output_path = create_logger(config.output_path, args.cfg, config.dataset.test_image_set)
+
+    test_rcnn(config, config.dataset.dataset, config.dataset.test_image_set, config.dataset.root_path, config.dataset.dataset_path,
+              ctx, os.path.join(final_output_path, '..', '_'.join([iset for iset in config.dataset.image_set.split('+')]), config.TRAIN.model_prefix), config.TEST.test_epoch,
+              args.vis, args.ignore_cache, args.shuffle, config.TEST.HAS_RPN, config.dataset.proposal, args.thresh, logger=logger, output_path=final_output_path)
+
+if __name__ == '__main__':
+    main()
diff --git a/faster_rcnn/train_end2end.py b/faster_rcnn/train_end2end.py
new file mode 100644
index 0000000..69dc285
--- /dev/null
+++ b/faster_rcnn/train_end2end.py
@@ -0,0 +1,166 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------														  
+import _init_paths
+
+import time
+import argparse
+import logging
+import pprint
+import os
+import sys
+from config.config import config, update_config
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train Faster-RCNN network')
+    # general
+    parser.add_argument('--cfg', help='experiment configure file name', required=True, type=str)
+
+    args, rest = parser.parse_known_args()
+    # update config
+    update_config(args.cfg)
+
+    # training
+    parser.add_argument('--frequent', help='frequency of logging', default=config.default.frequent, type=int)
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+curr_path = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(curr_path, '../external/mxnet', config.MXNET_VERSION))
+
+import shutil
+import numpy as np
+import mxnet as mx
+
+from symbols import *
+from core import callback, metric
+from core.loader import AnchorLoader
+from core.module import MutableModule
+from utils.create_logger import create_logger
+from utils.load_data import load_gt_roidb, merge_roidb, filter_roidb
+from utils.load_model import load_param
+from utils.PrefetchingIter import PrefetchingIter
+from utils.lr_scheduler import WarmupMultiFactorScheduler
+
+
+def train_net(args, ctx, pretrained, epoch, prefix, begin_epoch, end_epoch, lr, lr_step):
+    logger, final_output_path = create_logger(config.output_path, args.cfg, config.dataset.image_set)
+    prefix = os.path.join(final_output_path, prefix)
+
+    # load symbol
+    shutil.copy2(os.path.join(curr_path, 'symbols', config.symbol + '.py'), final_output_path)
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol(config, is_train=True)
+    feat_sym = sym.get_internals()['rpn_cls_score_output']
+
+    # setup multi-gpu
+    batch_size = len(ctx)
+    input_batch_size = config.TRAIN.BATCH_IMAGES * batch_size
+
+    # print config
+    pprint.pprint(config)
+    logger.info('training config:{}\n'.format(pprint.pformat(config)))
+
+    # load dataset and prepare imdb for training
+    image_sets = [iset for iset in config.dataset.image_set.split('+')]
+    roidbs = [load_gt_roidb(config.dataset.dataset, image_set, config.dataset.root_path, config.dataset.dataset_path,
+                            flip=config.TRAIN.FLIP)
+              for image_set in image_sets]
+    roidb = merge_roidb(roidbs)
+    roidb = filter_roidb(roidb, config)
+
+    # load training data
+    train_data = AnchorLoader(feat_sym, roidb, config, batch_size=input_batch_size, shuffle=config.TRAIN.SHUFFLE, ctx=ctx,
+                              feat_stride=config.network.RPN_FEAT_STRIDE, anchor_scales=config.network.ANCHOR_SCALES,
+                              anchor_ratios=config.network.ANCHOR_RATIOS, aspect_grouping=config.TRAIN.ASPECT_GROUPING)
+
+    # infer max shape
+    max_data_shape = [('data', (config.TRAIN.BATCH_IMAGES, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]
+    max_data_shape, max_label_shape = train_data.infer_shape(max_data_shape)
+    max_data_shape.append(('gt_boxes', (config.TRAIN.BATCH_IMAGES, 100, 5)))
+    print 'providing maximum shape', max_data_shape, max_label_shape
+
+    data_shape_dict = dict(train_data.provide_data_single + train_data.provide_label_single)
+    pprint.pprint(data_shape_dict)
+    sym_instance.infer_shape(data_shape_dict)
+
+    # load and initialize params
+    if config.TRAIN.RESUME:
+        print('continue training from ', begin_epoch)
+        arg_params, aux_params = load_param(prefix, begin_epoch, convert=True)
+    else:
+        arg_params, aux_params = load_param(pretrained, epoch, convert=True)
+        sym_instance.init_weight(config, arg_params, aux_params)
+
+    # check parameter shapes
+    sym_instance.check_parameter_shapes(arg_params, aux_params, data_shape_dict)
+
+    # create solver
+    fixed_param_prefix = config.network.FIXED_PARAMS
+    data_names = [k[0] for k in train_data.provide_data_single]
+    label_names = [k[0] for k in train_data.provide_label_single]
+
+    mod = MutableModule(sym, data_names=data_names, label_names=label_names,
+                        logger=logger, context=ctx, max_data_shapes=[max_data_shape for _ in range(batch_size)],
+                        max_label_shapes=[max_label_shape for _ in range(batch_size)], fixed_param_prefix=fixed_param_prefix)
+
+    if config.TRAIN.RESUME:
+        mod._preload_opt_states = '%s-%04d.states'%(prefix, begin_epoch)
+
+    # decide training params
+    # metric
+    rpn_eval_metric = metric.RPNAccMetric()
+    rpn_cls_metric = metric.RPNLogLossMetric()
+    rpn_bbox_metric = metric.RPNL1LossMetric()
+    eval_metric = metric.RCNNAccMetric(config)
+    cls_metric = metric.RCNNLogLossMetric(config)
+    bbox_metric = metric.RCNNL1LossMetric(config)
+    eval_metrics = mx.metric.CompositeEvalMetric()
+    # rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric
+    for child_metric in [rpn_eval_metric, rpn_cls_metric, rpn_bbox_metric, eval_metric, cls_metric, bbox_metric]:
+        eval_metrics.add(child_metric)
+    # callback
+    batch_end_callback = callback.Speedometer(train_data.batch_size, frequent=args.frequent)
+    means = np.tile(np.array(config.TRAIN.BBOX_MEANS), 2 if config.CLASS_AGNOSTIC else config.dataset.NUM_CLASSES)
+    stds = np.tile(np.array(config.TRAIN.BBOX_STDS), 2 if config.CLASS_AGNOSTIC else config.dataset.NUM_CLASSES)
+    epoch_end_callback = [mx.callback.module_checkpoint(mod, prefix, period=1, save_optimizer_states=True), callback.do_checkpoint(prefix, means, stds)]
+    # decide learning rate
+    base_lr = lr
+    lr_factor = config.TRAIN.lr_factor
+    lr_epoch = [float(epoch) for epoch in lr_step.split(',')]
+    lr_epoch_diff = [epoch - begin_epoch for epoch in lr_epoch if epoch > begin_epoch]
+    lr = base_lr * (lr_factor ** (len(lr_epoch) - len(lr_epoch_diff)))
+    lr_iters = [int(epoch * len(roidb) / batch_size) for epoch in lr_epoch_diff]
+    print('lr', lr, 'lr_epoch_diff', lr_epoch_diff, 'lr_iters', lr_iters)
+    lr_scheduler = WarmupMultiFactorScheduler(lr_iters, lr_factor, config.TRAIN.warmup, config.TRAIN.warmup_lr, config.TRAIN.warmup_step)
+    # optimizer
+    optimizer_params = {'momentum': config.TRAIN.momentum,
+                        'wd': config.TRAIN.wd,
+                        'learning_rate': lr,
+                        'lr_scheduler': lr_scheduler,
+                        'rescale_grad': 1.0,
+                        'clip_gradient': None}
+
+    if not isinstance(train_data, PrefetchingIter):
+        train_data = PrefetchingIter(train_data)
+
+    # train
+    mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
+            batch_end_callback=batch_end_callback, kvstore=config.default.kvstore,
+            optimizer='sgd', optimizer_params=optimizer_params,
+            arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
+
+
+def main():
+    print('Called with argument:', args)
+    ctx = [mx.gpu(int(i)) for i in config.gpus.split(',')]
+    train_net(args, ctx, config.network.pretrained, config.network.pretrained_epoch, config.TRAIN.model_prefix,
+              config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TRAIN.lr_step)
+
+if __name__ == '__main__':
+    main()
diff --git a/faster_rcnn/train_rcnn.py b/faster_rcnn/train_rcnn.py
new file mode 100644
index 0000000..8e1c5f0
--- /dev/null
+++ b/faster_rcnn/train_rcnn.py
@@ -0,0 +1,63 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Modified by Yuwen Xiong
+# --------------------------------------------------------			
+											  
+import _init_paths
+
+import time
+import argparse
+import logging
+import pprint
+import os
+import sys
+from config.config import config, update_config
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Train Faster-RCNN network')
+    # general
+    parser.add_argument('--cfg', help='experiment configure file name', required=True, type=str)
+
+    args, rest = parser.parse_known_args()
+    # update config
+    update_config(args.cfg)
+
+    # training
+    parser.add_argument('--frequent', help='frequency of logging', default=config.default.frequent, type=int)
+    args = parser.parse_args()
+    return args
+
+args = parse_args()
+curr_path = os.path.abspath(os.path.dirname(__file__))
+sys.path.insert(0, os.path.join(curr_path, '../external/mxnet', config.MXNET_VERSION))
+
+import shutil
+import numpy as np
+import mxnet as mx
+
+from function.train_rpn import train_rpn
+from function.test_rpn import test_rpn
+from function.train_rcnn import train_rcnn
+from utils.combine_model import combine_model
+from utils.create_logger import create_logger
+
+
+def main():
+    print ('Called with argument:', args)
+    ctx = [mx.gpu(int(i)) for i in config.gpus.split(',')]
+    logger, output_path = create_logger(config.output_path, args.cfg, config.dataset.image_set)
+    shutil.copy2(os.path.join(curr_path, 'symbols', config.symbol + '.py'), output_path)
+
+    prefix = os.path.join(output_path, 'rcnn')
+    logging.info('########## TRAIN rcnn WITH IMAGENET INIT AND RPN DETECTION')
+    train_rcnn(config, config.dataset.dataset, config.dataset.image_set, config.dataset.root_path, config.dataset.dataset_path,
+               args.frequent, config.default.kvstore, config.TRAIN.FLIP, config.TRAIN.SHUFFLE, config.TRAIN.RESUME,
+               ctx, config.network.pretrained, config.network.pretrained_epoch, prefix, config.TRAIN.begin_epoch,
+               config.TRAIN.end_epoch, train_shared=False, lr=config.TRAIN.lr, lr_step=config.TRAIN.lr_step,
+               proposal=config.dataset.proposal, logger=logger)
+
+if __name__ == '__main__':
+    main()
diff --git a/lib/dataset/__init__.py b/lib/dataset/__init__.py
index e0c2e55..984cc7a 100644
--- a/lib/dataset/__init__.py
+++ b/lib/dataset/__init__.py
@@ -1,5 +1,4 @@
 from imdb import IMDB
 from pascal_voc import PascalVOC
-from pascal_voc_segmentation import PascalVOC_Segmentation
-from cityscape_segmentation import CityScape_Segmentation
+from cityscape import CityScape
 from coco import coco
diff --git a/lib/dataset/cityscape_segmentation.py b/lib/dataset/cityscape.py
similarity index 81%
rename from lib/dataset/cityscape_segmentation.py
rename to lib/dataset/cityscape.py
index 7d793e0..76134a5 100644
--- a/lib/dataset/cityscape_segmentation.py
+++ b/lib/dataset/cityscape.py
@@ -1,3 +1,11 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2016 by Contributors
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Zheng Zhang
+# --------------------------------------------------------
+
 import cPickle
 import os
 import cv2
@@ -6,9 +14,8 @@
 
 from imdb import IMDB
 from PIL import Image
-from utils import image
 
-class CityScape_Segmentation(IMDB):
+class CityScape(IMDB):
     def __init__(self, image_set, root_path, dataset_path, result_path=None):
         """
         fill basic information to initialize imdb
@@ -18,7 +25,7 @@ def __init__(self, image_set, root_path, dataset_path, result_path=None):
         :return: imdb object
         """
         image_set_main_folder, image_set_sub_folder= image_set.split('_', 1)
-        super(CityScape_Segmentation, self).__init__('cityscape', image_set, root_path, dataset_path, result_path)  # set self.name
+        super(CityScape, self).__init__('cityscape', image_set, root_path, dataset_path, result_path)  # set self.name
 
         self.image_set_main_folder = image_set_main_folder
         self.image_set_sub_folder = image_set_sub_folder
@@ -163,19 +170,16 @@ def getpallete(self, num_cls):
 
         return pallete
 
-    def evaluate_segmentations(self, segmentations):
+    def evaluate_segmentations(self, pred_segmentations = None):
         """
         top level evaluations
         :param pred_segmentations: the pred segmentation result
         :return: the evaluation results
         """
+        if not (pred_segmentations is None):
+            self.write_segmentation_result(pred_segmentations)
 
-        res_file_folder = os.path.join(self.result_path, 'results', 'Segmentation')
-        if not os.path.exists(res_file_folder):
-            os.mkdir(res_file_folder)
-
-        info = self.do_python_eval(segmentations)
-        self.write_segmentation_result(segmentations, res_file_folder)
+        info = self._py_evaluate_segmentation()
         return info
 
 
@@ -199,20 +203,29 @@ def get_confusion_matrix(self, gt_label, pred_label, class_num):
 
         return confusion_matrix
 
-    def do_python_eval(self, pred_segmentations):
+    def _py_evaluate_segmentation(self):
         """
         This function is a wrapper to calculte the metrics for given pred_segmentation results
-        :param pred_segmentations: the pred segmentation result
         :return: the evaluation metrics
         """
+        res_file_folder = os.path.join(self.result_path, 'results')
+
         confusion_matrix = np.zeros((self.num_classes,self.num_classes))
         for i, index in enumerate(self.image_set_index):
             seg_gt_info = self.load_segdb_from_index(index)
+
             seg_gt = np.array(Image.open(seg_gt_info['seg_cls_path'])).astype('float32')
-            seg_pred = np.squeeze(pred_segmentations[i])
+
+            seg_pathes = os.path.split(seg_gt_info['seg_cls_path'])
+            res_image_name = seg_pathes[1][:-len('_gtFine_labelTrainIds.png')]
+            res_subfolder_name = os.path.split(seg_pathes[0])[-1]
+            res_save_folder = os.path.join(res_file_folder, res_subfolder_name)
+            res_save_path = os.path.join(res_save_folder, res_image_name + '.png')
+
+            seg_pred = np.array(Image.open(res_save_path)).astype('float32')
+            #seg_pred = np.squeeze(pred_segmentations[i])
 
             seg_pred = cv2.resize(seg_pred, (seg_gt.shape[1], seg_gt.shape[0]), interpolation=cv2.INTER_NEAREST)
-            seg_gt[seg_gt == 0] = 255
             ignore_index = seg_gt != 255
             seg_gt = seg_gt[ignore_index]
             seg_pred = seg_pred[ignore_index]
@@ -228,17 +241,32 @@ def do_python_eval(self, pred_segmentations):
 
         return {'meanIU':mean_IU, 'IU_array':IU_array}
 
-    def write_segmentation_result(self, segmentation_results, result_file_folder):
+    def write_segmentation_result(self, segmentation_results):
         """
         Write the segmentation result to result_file_folder
         :param segmentation_results: the prediction result
         :param result_file_folder: the saving folder
         :return: [None]
         """
+        res_file_folder = os.path.join(self.result_path, 'results')
+        if not os.path.exists(res_file_folder):
+            os.mkdir(res_file_folder)
+
         pallete = self.getpallete(256)
-        for i in range(len(segmentation_results)):
+        for i, index in enumerate(self.image_set_index):
+            seg_gt_info = self.load_segdb_from_index(index)
+
+            seg_pathes = os.path.split(seg_gt_info['seg_cls_path'])
+            res_image_name = seg_pathes[1][:-len('_gtFine_labelTrainIds.png')]
+            res_subfolder_name = os.path.split(seg_pathes[0])[-1]
+            res_save_folder = os.path.join(res_file_folder, res_subfolder_name)
+            res_save_path = os.path.join(res_save_folder, res_image_name + '.png')
+
+            if not os.path.exists(res_save_folder):
+                os.makedirs(res_save_folder)
+
             segmentation_result = np.uint8(np.squeeze(np.copy(segmentation_results[i])))
             segmentation_result = Image.fromarray(segmentation_result)
             segmentation_result.putpalette(pallete)
-            segmentation_result.save(os.path.join(result_file_folder, '%d_result.png'%(i)))
+            segmentation_result.save(res_save_path)
 
diff --git a/lib/dataset/coco.py b/lib/dataset/coco.py
index 59757c6..6bed171 100644
--- a/lib/dataset/coco.py
+++ b/lib/dataset/coco.py
@@ -16,6 +16,7 @@
 from bbox.bbox_transform import clip_boxes
 import multiprocessing as mp
 
+
 def coco_results_one_category_kernel(data_pack):
     cat_id = data_pack['cat_id']
     ann_type = data_pack['ann_type']
@@ -56,26 +57,6 @@ def coco_results_one_category_kernel(data_pack):
         cat_results.extend(result)
     return cat_results
 
-def generate_cache_seg_inst_kernel(annWithObjs):
-    """
-    generate cache_seg_inst
-    :param annWithObjs: tuple of anns and objs
-    """
-    ann = annWithObjs[0]
-    objs = annWithObjs[1]
-    gt_mask_file = ann['cache_seg_inst']
-    if not gt_mask_file:
-        return
-    gt_mask_flip_file = os.path.join(os.path.splitext(gt_mask_file)[0] + '_flip.hkl')
-    if os.path.exists(gt_mask_file) and os.path.exists(gt_mask_flip_file):
-        return
-    gt_mask_encode = [x['segmentation'] for x in objs]
-    gt_mask = mask_coco2voc(gt_mask_encode, ann['height'], ann['width'])
-    if not os.path.exists(gt_mask_file):
-        hkl.dump(gt_mask.astype('bool'), gt_mask_file, mode='w', compression='gzip')
-    # cache flip gt_masks
-    if not os.path.exists(gt_mask_flip_file):
-        hkl.dump(gt_mask[:, :, ::-1].astype('bool'), gt_mask_flip_file, mode='w', compression='gzip')
 
 class coco(IMDB):
     def __init__(self, image_set, root_path, data_path, result_path=None, mask_size=-1, binary_thresh=None):
@@ -145,40 +126,6 @@ def gt_roidb(self):
 
         return gt_roidb
 
-    def gt_sdsdb(self):
-        """
-        :return:
-        """
-        cache_file = os.path.join(self.cache_path, self.name + '_gt_sdsdb.pkl')
-        """
-        if os.path.exists(cache_file):
-            with open(cache_file, 'rb') as fid:
-                sdsdb = cPickle.load(fid)
-            print '{} gt sdsdb loaded from {}'.format(self.name, cache_file)
-            return sdsdb
-        """
-        # for internal useage
-        tic();
-        gt_sdsdb_temp = [self.load_coco_sds_annotation(index) for index in self.image_set_index]
-        gt_sdsdb = [x[0] for x in gt_sdsdb_temp]
-        print 'prepare gt_sdsdb using', toc(), 'seconds';
-        #objs = [x[1] for x in gt_sdsdb_temp]
-        tic();
-        generate_cache_seg_inst_kernel(gt_sdsdb_temp[0])
-        pool = mp.Pool(mp.cpu_count())
-        pool.map(generate_cache_seg_inst_kernel, gt_sdsdb_temp)
-        pool.close()
-        pool.join()
-        print 'generate cache_seg_inst using', toc(), 'seconds';
-
-        """
-        with open(cache_file, 'wb') as fid:
-            cPickle.dump(gt_sdsdb, fid, cPickle.HIGHEST_PROTOCOL)
-        """
-        # for future release usage
-        # need to implement load sbd data
-        return gt_sdsdb
-
     def _load_coco_annotation(self, index):
         """
         coco ann: [u'segmentation', u'area', u'iscrowd', u'image_id', u'bbox', u'category_id', u'id']
diff --git a/lib/dataset/pascal_voc.py b/lib/dataset/pascal_voc.py
index 481e030..d4ad56e 100644
--- a/lib/dataset/pascal_voc.py
+++ b/lib/dataset/pascal_voc.py
@@ -23,7 +23,6 @@
 from pascal_voc_eval import voc_eval, voc_eval_sds
 from ds_utils import unique_boxes, filter_small_boxes
 
-
 class PascalVOC(IMDB):
     def __init__(self, image_set, root_path, devkit_path, result_path=None, mask_size=-1, binary_thresh=None):
         """
@@ -33,7 +32,8 @@ def __init__(self, image_set, root_path, devkit_path, result_path=None, mask_siz
         :param devkit_path: data and results
         :return: imdb object
         """
-        year, image_set = image_set.split('_')
+        year = image_set.split('_')[0]
+        image_set = image_set[len(year) + 1 : len(image_set)]
         super(PascalVOC, self).__init__('voc_' + year, image_set, root_path, devkit_path, result_path)  # set self.name
 
         self.year = year
@@ -79,26 +79,16 @@ def image_path_from_index(self, index):
         assert os.path.exists(image_file), 'Path does not exist: {}'.format(image_file)
         return image_file
 
-    def mask_path_from_index(self, index, gt_mask):
+    def segmentation_path_from_index(self, index):
         """
-        given image index, cache high resolution mask and return full path of masks
+        given image index, find out the full path of segmentation class
         :param index: index of a specific image
-        :return: full path of this mask
-        """
-        if self.image_set == 'val':
-            return []
-        cache_file = os.path.join(self.cache_path, 'VOCMask')
-        if not os.path.exists(cache_file):
-            os.makedirs(cache_file)
-        # instance level segmentation
-        gt_mask_file = os.path.join(cache_file, index + '.hkl')
-        if not os.path.exists(gt_mask_file):
-            hkl.dump(gt_mask.astype('bool'), gt_mask_file, mode='w', compression='gzip')
-        # cache flip gt_masks
-        gt_mask_flip_file = os.path.join(cache_file, index + '_flip.hkl')
-        if not os.path.exists(gt_mask_flip_file):
-            hkl.dump(gt_mask[:, :, ::-1].astype('bool'), gt_mask_flip_file, mode='w', compression='gzip')
-        return gt_mask_file
+        :return: full path of segmentation class
+        """
+        seg_class_file = os.path.join(self.data_path, 'SegmentationClass', index + '.png')
+        assert os.path.exists(seg_class_file), 'Path does not exist: {}'.format(seg_class_file)
+        return seg_class_file
+
     def gt_roidb(self):
         """
         return ground truth image regions database
@@ -118,23 +108,24 @@ def gt_roidb(self):
 
         return gt_roidb
 
-    def gt_sdsdb(self):
+    def gt_segdb(self):
         """
-        :return:
+        return ground truth image regions database
+        :return: imdb[image_index]['boxes', 'gt_classes', 'gt_overlaps', 'flipped']
         """
-        cache_file = os.path.join(self.cache_path, self.name + '_gt_sdsdb.pkl')
+        cache_file = os.path.join(self.cache_path, self.name + '_gt_segdb.pkl')
         if os.path.exists(cache_file):
             with open(cache_file, 'rb') as fid:
-                sdsdb = cPickle.load(fid)
-            print '{} gt sdsdb loaded from {}'.format(self.name, cache_file)
-            return sdsdb
-        # for internal useage
-        gt_sdsdb = [self.load_pascal_sds_annotation(index) for index in self.image_set_index]
+                segdb = cPickle.load(fid)
+            print '{} gt segdb loaded from {}'.format(self.name, cache_file)
+            return segdb
+
+        gt_segdb = [self.load_pascal_segmentation_annotation(index) for index in self.image_set_index]
         with open(cache_file, 'wb') as fid:
-            cPickle.dump(gt_sdsdb, fid, cPickle.HIGHEST_PROTOCOL)
-        # for future release usage
-        # need to implement load sbd data
-        return gt_sdsdb
+            cPickle.dump(gt_segdb, fid, cPickle.HIGHEST_PROTOCOL)
+        print 'wrote gt segdb to {}'.format(cache_file)
+
+        return gt_segdb
 
     def load_pascal_annotation(self, index):
         """
@@ -234,67 +225,23 @@ def selective_search_roidb(self, gt_roidb, append_gt=False):
 
         return roidb
 
-    def load_pascal_sds_annotation(self, index):
-        print index
-        sds_rec = dict()
-        sds_rec['image'] = self.image_path_from_index(index)
-        size = cv2.imread(sds_rec['image']).shape
-        sds_rec['height'] = size[0]
-        sds_rec['width'] = size[1]
-        # class level segmentation
-        seg_cls_name = os.path.join(self.data_path, 'SegmentationClass', index + '.png')
-        seg_cls_data = PIL.Image.open(seg_cls_name)
-        seg_cls_data = np.array(seg_cls_data.getdata(), np.uint8).reshape(seg_cls_data.size[1], seg_cls_data.size[0])
-        # instance level segmentation
-        seg_obj_name = os.path.join(self.data_path, 'SegmentationObject', index + '.png')
-        seg_obj_data = PIL.Image.open(seg_obj_name)
-        seg_obj_data = np.array(seg_obj_data.getdata(), np.uint8).reshape(seg_obj_data.size[1], seg_obj_data.size[0])
-        # check unique instance
-        unique_inst = np.unique(seg_obj_data)
-        bg_inds = np.where(unique_inst == 0)[0]
-        unique_inst = np.delete(unique_inst, bg_inds)
-        border_inds = np.where(unique_inst == 255)[0]
-        unique_inst = np.delete(unique_inst, border_inds)
-
-        num_objs = len(unique_inst)
-        boxes = np.zeros((num_objs, 4), dtype=np.uint16)
-        gt_classes = np.zeros(num_objs, dtype=np.int32)
-        overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)
-        # TODO: figure out a way to pass mask size
-        gt_masks = np.zeros((num_objs, size[0], size[1]))
-
-        for idx, inst_id in enumerate(unique_inst):
-            [r, c] = np.where(seg_obj_data == inst_id)
-            x1 = np.min(c)
-            x2 = np.max(c)
-            y1 = np.min(r)
-            y2 = np.max(r)
-            cur_gt_mask = (seg_obj_data == inst_id)
-            cur_gt_mask_cls = seg_cls_data[cur_gt_mask]
-            #roi_mask = (seg_obj_data[y1:y2+1, x1:x2+1] == inst_id)
-            #roi_mask_cls = seg_cls_data[y1:y2+1, x1:x2+1]
-            #roi_mask_cls = roi_mask_cls[roi_mask]
-            assert np.unique(cur_gt_mask_cls).shape[0] == 1
-            cur_inst_cls = np.unique(cur_gt_mask_cls)[0]
-            # resize mask to fixed size and convert to boolean
-            # roi_mask = cv2.resize(roi_mask.astype(np.float32), (self.mask_size, self.mask_size))
-
-            boxes[idx, :] = [x1, y1, x2, y2]
-            gt_classes[idx] = cur_inst_cls
-            gt_masks[idx, :, :] = cur_gt_mask
-            overlaps[idx, cur_inst_cls] = 1.0
-
-        sds_rec.update({
-            'boxes': boxes,
-            'gt_classes': gt_classes,
-            'gt_overlaps': overlaps,
-            'max_classes': overlaps.argmax(axis=1),
-            'max_overlaps': overlaps.max(axis=1),
-
-            'cache_seg_inst': self.mask_path_from_index(index, gt_masks),
-            'flipped': False
-        })
-        return sds_rec
+    def load_pascal_segmentation_annotation(self, index):
+        """
+        for a given index, load image and bounding boxes info from XML file
+        :param index: index of a specific image
+        :return: record['seg_cls_path', 'flipped']
+        """
+        import xml.etree.ElementTree as ET
+        seg_rec = dict()
+        seg_rec['image'] = self.image_path_from_index(index)
+        size = cv2.imread(seg_rec['image']).shape
+        seg_rec['height'] = size[0]
+        seg_rec['width'] = size[1]
+
+        seg_rec['seg_cls_path'] = self.segmentation_path_from_index(index)
+        seg_rec['flipped'] = False
+
+        return seg_rec
 
     def evaluate_detections(self, detections):
         """
@@ -317,99 +264,121 @@ def evaluate_detections(self, detections):
         info = self.do_python_eval()
         return info
 
-    def evaluate_sds(self, all_boxes, all_masks):
-        self._write_voc_seg_results_file(all_boxes, all_masks)
+    def evaluate_segmentations(self, pred_segmentations=None):
+        """
+        top level evaluations
+        :param pred_segmentations: the pred segmentation result
+        :return: the evaluation results
+        """
+        # make all these folders for results
+        if not (pred_segmentations is None):
+            self.write_pascal_segmentation_result(pred_segmentations)
+
         info = self._py_evaluate_segmentation()
         return info
 
-    def _write_voc_seg_results_file(self, all_boxes, all_masks):
+    def write_pascal_segmentation_result(self, pred_segmentations):
         """
-        Write results as a pkl file, note this is different from
-        detection task since it's difficult to write masks to txt
+        Write pred segmentation to res_file_folder
+        :param pred_segmentations: the pred segmentation results
+        :param res_file_folder: the saving folder
+        :return: [None]
         """
-        # make all these folders for results
         result_dir = os.path.join(self.result_path, 'results')
         if not os.path.exists(result_dir):
             os.mkdir(result_dir)
-        # Always reformat result in case of sometimes masks are not
-        # binary or is in shape (n, sz*sz) instead of (n, sz, sz)
-        all_boxes, all_masks = self._reformat_result(all_boxes, all_masks)
-        for cls_inds, cls in enumerate(self.classes):
-            if cls == '__background__':
-                continue
-            print 'Writing {} VOC results file'.format(cls)
-            filename = os.path.join(result_dir, cls + '_det.pkl')
-            print filename
-            with open(filename, 'wb') as f:
-                cPickle.dump(all_boxes[cls_inds], f, cPickle.HIGHEST_PROTOCOL)
-            filename = os.path.join(result_dir, cls + '_seg.pkl')
-            with open(filename, 'wb') as f:
-                cPickle.dump(all_masks[cls_inds], f, cPickle.HIGHEST_PROTOCOL)
-
-    def _reformat_result(self, boxes, masks):
-        num_images = self.num_images
-        num_class = len(self.classes)
-        reformat_masks = [[[] for _ in xrange(num_images)]
-                          for _ in xrange(num_class)]
-        for cls_inds in xrange(1, num_class):
-            for img_inds in xrange(num_images):
-                if len(masks[cls_inds][img_inds]) == 0:
-                    continue
-                num_inst = masks[cls_inds][img_inds].shape[0]
-                #print num_inst
-                #print masks[cls_inds][img_inds].shape
-                reformat_masks[cls_inds][img_inds] = masks[cls_inds][img_inds]\
-                    .reshape(num_inst, self.mask_size, self.mask_size)
-                # reformat_masks[cls_inds][img_inds] = reformat_masks[cls_inds][img_inds] >= 0.4
-        all_masks = reformat_masks
-        return boxes, all_masks
+        year_folder = os.path.join(self.result_path, 'results', 'VOC' + self.year)
+        if not os.path.exists(year_folder):
+            os.mkdir(year_folder)
+        res_file_folder = os.path.join(self.result_path, 'results', 'VOC' + self.year, 'Segmentation')
+        if not os.path.exists(res_file_folder):
+            os.mkdir(res_file_folder)
+
+        result_dir = os.path.join(self.result_path, 'results', 'VOC' + self.year, 'Segmentation')
+        if not os.path.exists(result_dir):
+            os.mkdir(result_dir)
+
+        pallete = self.get_pallete(256)
+
+        for i, index in enumerate(self.image_set_index):
+            segmentation_result = np.uint8(np.squeeze(np.copy(pred_segmentations[i])))
+            segmentation_result = PIL.Image.fromarray(segmentation_result)
+            segmentation_result.putpalette(pallete)
+            segmentation_result.save(os.path.join(result_dir, '%s.png'%(index)))
+
+    def get_pallete(self, num_cls):
+        """
+        this function is to get the colormap for visualizing the segmentation mask
+        :param num_cls: the number of visulized class
+        :return: the pallete
+        """
+        n = num_cls
+        pallete = [0]*(n*3)
+        for j in xrange(0,n):
+                lab = j
+                pallete[j*3+0] = 0
+                pallete[j*3+1] = 0
+                pallete[j*3+2] = 0
+                i = 0
+                while (lab > 0):
+                        pallete[j*3+0] |= (((lab >> 0) & 1) << (7-i))
+                        pallete[j*3+1] |= (((lab >> 1) & 1) << (7-i))
+                        pallete[j*3+2] |= (((lab >> 2) & 1) << (7-i))
+                        i = i + 1
+                        lab >>= 3
+        return pallete
+
+    def get_confusion_matrix(self, gt_label, pred_label, class_num):
+        """
+        Calcute the confusion matrix by given label and pred
+        :param gt_label: the ground truth label
+        :param pred_label: the pred label
+        :param class_num: the nunber of class
+        :return: the confusion matrix
+        """
+        index = (gt_label * class_num + pred_label).astype('int32')
+        label_count = np.bincount(index)
+        confusion_matrix = np.zeros((class_num, class_num))
+
+        for i_label in range(class_num):
+            for i_pred_label in range(class_num):
+                cur_index = i_label * class_num + i_pred_label
+                if cur_index < len(label_count):
+                    confusion_matrix[i_label, i_pred_label] = label_count[cur_index]
+
+        return confusion_matrix
 
     def _py_evaluate_segmentation(self):
-        info_str = ''
-        gt_dir = self.data_path
-        imageset_file = os.path.join(self.data_path, 'ImageSets', 'Main', self.image_set + '.txt')
-        cache_dir = os.path.join(self.devkit_path, 'annotations_cache')
-        output_dir = os.path.join(self.result_path, 'results')
-        aps = []
-        # define this as true according to SDS's evaluation protocol
-        use_07_metric = True
-        print 'VOC07 metric? ' + ('Yes' if use_07_metric else 'No')
-        info_str += 'VOC07 metric? ' + ('Y' if use_07_metric else 'No')
-        info_str += '\n'
-        if not os.path.isdir(output_dir):
-            os.mkdir(output_dir)
-        print '~~~~~~ Evaluation use min overlap = 0.5 ~~~~~~'
-        info_str += '~~~~~~ Evaluation use min overlap = 0.5 ~~~~~~'
-        info_str += '\n'
-        for i, cls in enumerate(self.classes):
-            if cls == '__background__':
-                continue
-            det_filename = os.path.join(output_dir, cls + '_det.pkl')
-            seg_filename = os.path.join(output_dir, cls + '_seg.pkl')
-            ap = voc_eval_sds(det_filename, seg_filename, gt_dir,
-                              imageset_file, cls, cache_dir, self.classes, self.mask_size, self.binary_thresh, ov_thresh=0.5)
-            aps += [ap]
-            print('AP for {} = {:.2f}'.format(cls, ap*100))
-            info_str += 'AP for {} = {:.2f}\n'.format(cls, ap*100)
-        print('Mean AP@0.5 = {:.2f}'.format(np.mean(aps)*100))
-        info_str += 'Mean AP@0.5 = {:.2f}\n'.format(np.mean(aps)*100)
-        print '~~~~~~ Evaluation use min overlap = 0.7 ~~~~~~'
-        info_str += '~~~~~~ Evaluation use min overlap = 0.7 ~~~~~~\n'
-        aps = []
-        for i, cls in enumerate(self.classes):
-            if cls == '__background__':
-                continue
-            det_filename = os.path.join(output_dir, cls + '_det.pkl')
-            seg_filename = os.path.join(output_dir, cls + '_seg.pkl')
-            ap = voc_eval_sds(det_filename, seg_filename, gt_dir,
-                              imageset_file, cls, cache_dir, self.classes, self.mask_size, self.binary_thresh, ov_thresh=0.7)
-            aps += [ap]
-            print('AP for {} = {:.2f}'.format(cls, ap*100))
-            info_str += 'AP for {} = {:.2f}\n'.format(cls, ap*100)
-        print('Mean AP@0.7 = {:.2f}'.format(np.mean(aps)*100))
-        info_str += 'Mean AP@0.7 = {:.2f}\n'.format(np.mean(aps)*100)
+        """
+        This function is a wrapper to calculte the metrics for given pred_segmentation results
+        :param pred_segmentations: the pred segmentation result
+        :return: the evaluation metrics
+        """
+        confusion_matrix = np.zeros((self.num_classes,self.num_classes))
+        result_dir = os.path.join(self.result_path, 'results', 'VOC' + self.year, 'Segmentation')
 
-        return info_str
+        for i, index in enumerate(self.image_set_index):
+            seg_gt_info = self.load_pascal_segmentation_annotation(index)
+            seg_gt_path = seg_gt_info['seg_cls_path']
+            seg_gt = np.array(PIL.Image.open(seg_gt_path)).astype('float32')
+            seg_pred_path = os.path.join(result_dir, '%s.png'%(index))
+            seg_pred = np.array(PIL.Image.open(seg_pred_path)).astype('float32')
+
+            seg_gt = cv2.resize(seg_gt, (seg_pred.shape[1], seg_pred.shape[0]), interpolation=cv2.INTER_NEAREST)
+            ignore_index = seg_gt != 255
+            seg_gt = seg_gt[ignore_index]
+            seg_pred = seg_pred[ignore_index]
+
+            confusion_matrix += self.get_confusion_matrix(seg_gt, seg_pred, self.num_classes)
+
+        pos = confusion_matrix.sum(1)
+        res = confusion_matrix.sum(0)
+        tp = np.diag(confusion_matrix)
+
+        IU_array = (tp / np.maximum(1.0, pos + res - tp))
+        mean_IU = IU_array.mean()
+
+        return {'meanIU':mean_IU, 'IU_array':IU_array}
 
     def get_result_file_template(self):
         """
diff --git a/lib/dataset/pascal_voc_eval.py b/lib/dataset/pascal_voc_eval.py
index d779a9a..f5cc106 100644
--- a/lib/dataset/pascal_voc_eval.py
+++ b/lib/dataset/pascal_voc_eval.py
@@ -360,4 +360,4 @@ def check_voc_sds_cache(cache_dir, devkit_path, image_names, class_names):
                 continue
             cachefile = os.path.join(cache_dir, name + '_mask_gt.pkl')
             with open(cachefile, 'wb') as f:
-                cPickle.dump(record_list[cls_ind], f)
+                cPickle.dump(record_list[cls_ind], f)
\ No newline at end of file
diff --git a/lib/dataset/pascal_voc_segmentation.py b/lib/dataset/pascal_voc_segmentation.py
deleted file mode 100644
index de53cf3..0000000
--- a/lib/dataset/pascal_voc_segmentation.py
+++ /dev/null
@@ -1,221 +0,0 @@
-"""
-Pascal VOC Segmentation database
-This class loads ground truth notations from standard Pascal VOC XML data formats
-and transform them into IMDB format. Selective search is used for proposals, see segdb
-function. Results are written as the Pascal VOC format. Evaluation is based on mAP
-criterion.
-"""
-
-import cPickle
-import os
-import cv2
-import numpy as np
-
-from imdb import IMDB
-from PIL import Image
-from utils import image
-
-class PascalVOC_Segmentation(IMDB):
-    def __init__(self, image_set, root_path, devkit_path, result_path=None):
-        """
-        fill basic information to initialize imdb
-        :param image_set: 2007_trainval, 2007_test, etc
-        :param root_path: 'selective_search_data' and 'cache'
-        :param devkit_path: data and results
-        :return: imdb object
-        """
-        year, image_set = image_set.split('_', 1)
-        super(PascalVOC_Segmentation, self).__init__('voc_' + year, image_set, root_path, devkit_path, result_path)  # set self.name
-
-        self.year = year
-        self.root_path = root_path
-        self.devkit_path = devkit_path
-        self.data_path = os.path.join(devkit_path, 'VOC' + year)
-
-        self.classes = ['__background__',  # always index 0
-                        'aeroplane', 'bicycle', 'bird', 'boat',
-                        'bottle', 'bus', 'car', 'cat', 'chair',
-                        'cow', 'diningtable', 'dog', 'horse',
-                        'motorbike', 'person', 'pottedplant',
-                        'sheep', 'sofa', 'train', 'tvmonitor']
-        self.num_classes = len(self.classes)
-        self.image_set_index = self.load_image_set_index()
-        self.num_images = len(self.image_set_index)
-        print 'num_images', self.num_images
-
-        self.config = {'comp_id': 'comp4',
-                       'use_diff': False,
-                       'min_size': 2}
-
-    def load_image_set_index(self):
-        """
-        find out which indexes correspond to given image set (train or val)
-        :return:
-        """
-        image_set_index_file = os.path.join(self.data_path, 'ImageSets', 'Segmentation', self.image_set + '.txt')
-        assert os.path.exists(image_set_index_file), 'Path does not exist: {}'.format(image_set_index_file)
-        with open(image_set_index_file) as f:
-            image_set_index = [x.strip() for x in f.readlines()]
-        return image_set_index
-
-    def image_path_from_index(self, index):
-        """
-        given image index, find out full path
-        :param index: index of a specific image
-        :return: full path of this image
-        """
-        image_file = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
-        assert os.path.exists(image_file), 'Path does not exist: {}'.format(image_file)
-        return image_file
-
-    def segmentation_class_path_from_index(self, index):
-        """
-        given image index, find out the full path of segmentation class
-        :param index: index of a specific image
-        :return: full path of segmentation class
-        """
-        seg_class_file = os.path.join(self.data_path, 'SegmentationClass', index + '.png')
-        assert os.path.exists(seg_class_file), 'Path does not exist: {}'.format(seg_class_file)
-        return seg_class_file
-
-    def gt_segdb(self):
-        """
-        return ground truth image regions database
-        :return: imdb[image_index]['boxes', 'gt_classes', 'gt_overlaps', 'flipped']
-        """
-        cache_file = os.path.join(self.cache_path, self.name + '_gt_segdb.pkl')
-        if os.path.exists(cache_file):
-            with open(cache_file, 'rb') as fid:
-                segdb = cPickle.load(fid)
-            print '{} gt segdb loaded from {}'.format(self.name, cache_file)
-            return segdb
-
-        gt_segdb = [self.load_pascal_annotation(index) for index in self.image_set_index]
-        with open(cache_file, 'wb') as fid:
-            cPickle.dump(gt_segdb, fid, cPickle.HIGHEST_PROTOCOL)
-        print 'wrote gt segdb to {}'.format(cache_file)
-
-        return gt_segdb
-
-    def load_pascal_annotation(self, index):
-        """
-        for a given index, load image and bounding boxes info from XML file
-        :param index: index of a specific image
-        :return: record['seg_cls_path', 'flipped']
-        """
-        import xml.etree.ElementTree as ET
-        seg_rec = dict()
-        seg_rec['image'] = self.image_path_from_index(index)
-        size = cv2.imread(seg_rec['image']).shape
-        seg_rec['height'] = size[0]
-        seg_rec['width'] = size[1]
-
-        seg_rec['seg_cls_path'] = self.segmentation_class_path_from_index(index)
-        seg_rec['flipped'] = False
-
-        return seg_rec
-
-    def getpallete(self, num_cls):
-        """
-        this function is to get the colormap for visualizing the segmentation mask
-        :param num_cls: the number of visulized class
-        :return: the pallete
-        """
-        n = num_cls
-        pallete = [0]*(n*3)
-        for j in xrange(0,n):
-                lab = j
-                pallete[j*3+0] = 0
-                pallete[j*3+1] = 0
-                pallete[j*3+2] = 0
-                i = 0
-                while (lab > 0):
-                        pallete[j*3+0] |= (((lab >> 0) & 1) << (7-i))
-                        pallete[j*3+1] |= (((lab >> 1) & 1) << (7-i))
-                        pallete[j*3+2] |= (((lab >> 2) & 1) << (7-i))
-                        i = i + 1
-                        lab >>= 3
-        return pallete
-
-    def evaluate_segmentations(self, pred_segmentations):
-        """
-        top level evaluations
-        :param pred_segmentations: the pred segmentation result
-        :return: the evaluation results
-        """
-        # make all these folders for results
-        result_dir = os.path.join(self.result_path, 'results')
-        if not os.path.exists(result_dir):
-            os.mkdir(result_dir)
-        year_folder = os.path.join(self.result_path, 'results', 'VOC' + self.year)
-        if not os.path.exists(year_folder):
-            os.mkdir(year_folder)
-        res_file_folder = os.path.join(self.result_path, 'results', 'VOC' + self.year, 'Segmentation')
-        if not os.path.exists(res_file_folder):
-            os.mkdir(res_file_folder)
-
-        info = self.do_python_eval(pred_segmentations)
-        self.write_segmentation_result(pred_segmentations, res_file_folder)
-        return info
-
-    def write_segmentation_result(self, pred_segmentations, res_file_folder):
-        """
-        Write pred segmentation to res_file_folder
-        :param pred_segmentations: the pred segmentation results
-        :param res_file_folder: the saving folder
-        :return: [None]
-        """
-        pallete = self.getpallete(256)
-
-        for i in range(len(pred_segmentations)):
-            segmentation_result = np.uint8(np.squeeze(np.copy(pred_segmentations[i])))
-            segmentation_result = Image.fromarray(segmentation_result)
-            segmentation_result.putpalette(pallete)
-            segmentation_result.save(os.path.join(res_file_folder, '%d_result.png'%(i)))
-
-    def get_confusion_matrix(self, gt_label, pred_label, class_num):
-        """
-        Calcute the confusion matrix by given label and pred
-        :param gt_label: the ground truth label
-        :param pred_label: the pred label
-        :param class_num: the nunber of class
-        :return: the confusion matrix
-        """
-        index = (gt_label * class_num + pred_label).astype('int32')
-        label_count = np.bincount(index)
-        confusion_matrix = np.zeros((class_num, class_num))
-
-        for i_label in range(class_num):
-            for i_pred_label in range(class_num):
-                cur_index = i_label * class_num + i_pred_label
-                if cur_index < len(label_count):
-                    confusion_matrix[i_label, i_pred_label] = label_count[cur_index]
-
-        return confusion_matrix
-
-    def do_python_eval(self, pred_segmentations):
-        """
-        This function is a wrapper to calculte the metrics for given pred_segmentation results
-        :param pred_segmentations: the pred segmentation result
-        :return: the evaluation metrics
-        """
-        confusion_matrix = np.zeros((self.num_classes,self.num_classes))
-        for i, index in enumerate(self.image_set_index):
-            seg_gt_info = self.load_pascal_annotation(index)
-            seg_gt = np.array(Image.open(seg_gt_info['seg_cls_path'])).astype('float32')
-            seg_pred = np.squeeze(pred_segmentations[i])
-
-            seg_gt = cv2.resize(seg_gt, (seg_pred.shape[1], seg_pred.shape[0]), interpolation=cv2.INTER_NEAREST)
-            ignore_index = seg_gt != 255
-            seg_gt = seg_gt[ignore_index]
-            seg_pred = seg_pred[ignore_index]
-
-            confusion_matrix += self.get_confusion_matrix(seg_gt, seg_pred, self.num_classes)
-
-        pos = confusion_matrix.sum(1)
-        res = confusion_matrix.sum(0)
-        tp = np.diag(confusion_matrix)
-
-        mean_IU = (tp / np.maximum(1.0, pos + res - tp)).mean()
-
-        return {'meanIU':mean_IU}
diff --git a/lib/mask/mask_transform.py b/lib/mask/mask_transform.py
index 05e76b3..84e6ad5 100644
--- a/lib/mask/mask_transform.py
+++ b/lib/mask/mask_transform.py
@@ -6,24 +6,6 @@
 # --------------------------------------------------------
 
 import numpy as np
-import os
-import cv2
-
-
-
-def get_gt_masks(gt_mask_file, size):
-    """
-    This function load cached gt_masks from .hkl
-    :param roidb:
-    :return:
-    """
-    assert os.path.exists(gt_mask_file), '%s does not exist'.format(gt_mask_file)
-    gt_masks = hkl.load(gt_mask_file)
-    num_mask = gt_masks.shape[0]
-    processed_masks = np.zeros((num_mask, size[0], size[1]))
-    for i in range(num_mask):
-        processed_masks[i,:,:] = cv2.resize(gt_masks[i].astype('float'), (size[1], size[0]))
-    return processed_masks
 
 
 def intersect_box_mask(ex_box, gt_box, gt_mask):
diff --git a/lib/nms/cpu_nms.pyx b/lib/nms/cpu_nms.pyx
index 6b8df57..c1266bc 100644
--- a/lib/nms/cpu_nms.pyx
+++ b/lib/nms/cpu_nms.pyx
@@ -22,7 +22,7 @@ def cpu_nms(np.ndarray[np.float32_t, ndim=2] dets, np.float thresh):
     cdef np.ndarray[np.float32_t, ndim=1] scores = dets[:, 4]
 
     cdef np.ndarray[np.float32_t, ndim=1] areas = (x2 - x1 + 1) * (y2 - y1 + 1)
-    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1].astype('f')
+    cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1].astype('i')
 
     cdef int ndets = dets.shape[0]
     cdef np.ndarray[np.int_t, ndim=1] suppressed = \
diff --git a/lib/utils/load_data.py b/lib/utils/load_data.py
index 768ced2..9cbf045 100644
--- a/lib/utils/load_data.py
+++ b/lib/utils/load_data.py
@@ -12,17 +12,6 @@ def load_gt_roidb(dataset_name, image_set_name, root_path, dataset_path, result_
     return roidb
 
 
-def load_gt_sdsdb(dataset_name, image_set_name, root_path, dataset_path,
-                  result_path=None, flip=False, mask_size=21, binary_thresh=0.4):
-    """ load ground truth sdsdb """
-    imdb = eval(dataset_name)(image_set_name, root_path, dataset_path, result_path,
-                              mask_size=mask_size, binary_thresh=binary_thresh)
-    sdsdb = imdb.gt_sdsdb()
-    if flip:
-        sdsdb = imdb.append_flipped_images(sdsdb)
-    return sdsdb
-
-
 def load_proposal_roidb(dataset_name, image_set_name, root_path, dataset_path, result_path=None,
                         proposal='rpn', append_gt=True, flip=False):
     """ load proposal roidb (append_gt when training) """
@@ -61,6 +50,7 @@ def is_valid(entry):
 
     return filtered_roidb
 
+
 def load_gt_segdb(dataset_name, image_set_name, root_path, dataset_path, result_path=None,
                   flip=False):
     """ load ground truth segdb """
@@ -70,6 +60,7 @@ def load_gt_segdb(dataset_name, image_set_name, root_path, dataset_path, result_
         segdb = imdb.append_flipped_images_for_segmentation(segdb)
     return segdb
 
+
 def merge_segdb(segdbs):
     """ segdb are list, concat them together """
     segdb = segdbs[0]
diff --git a/lib/utils/show_offset.py b/lib/utils/show_offset.py
new file mode 100644
index 0000000..88f40f0
--- /dev/null
+++ b/lib/utils/show_offset.py
@@ -0,0 +1,136 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Guodong Zhang
+# --------------------------------------------------------
+
+import matplotlib.pyplot as plt
+import numpy as np
+
+def show_boxes_simple(bbox, color='r', lw=2):
+    rect = plt.Rectangle((bbox[0], bbox[1]), bbox[2] - bbox[0],
+                          bbox[3] - bbox[1], fill=False, edgecolor=color, linewidth=lw)
+    plt.gca().add_patch(rect)
+
+def kernel_inv_map(vis_attr, target_point, map_h, map_w):
+    pos_shift = [vis_attr['dilation'] * 0 - vis_attr['pad'],
+                 vis_attr['dilation'] * 1 - vis_attr['pad'],
+                 vis_attr['dilation'] * 2 - vis_attr['pad']]
+    source_point = []
+    for idx in range(vis_attr['filter_size']**2):
+        cur_source_point = np.array([target_point[0] + pos_shift[idx / 3],
+                                     target_point[1] + pos_shift[idx % 3]])
+        if cur_source_point[0] < 0 or cur_source_point[1] < 0 \
+                or cur_source_point[0] > map_h - 1 or cur_source_point[1] > map_w - 1:
+            continue
+        source_point.append(cur_source_point.astype('f'))
+    return source_point
+
+def offset_inv_map(source_points, offset):
+    for idx, _ in enumerate(source_points):
+        source_points[idx][0] += offset[2*idx]
+        source_points[idx][1] += offset[2*idx + 1]
+    return source_points
+
+def get_bottom_position(vis_attr, top_points, all_offset):
+    map_h = all_offset[0].shape[2]
+    map_w = all_offset[0].shape[3]
+
+    for level in range(vis_attr['plot_level']):
+        source_points = []
+        for idx, cur_top_point in enumerate(top_points):
+            cur_top_point = np.round(cur_top_point)
+            if cur_top_point[0] < 0 or cur_top_point[1] < 0 \
+                or cur_top_point[0] > map_h-1 or cur_top_point[1] > map_w-1:
+                continue
+            cur_source_point = kernel_inv_map(vis_attr, cur_top_point, map_h, map_w)
+            cur_offset = np.squeeze(all_offset[level][:, :, int(cur_top_point[0]), int(cur_top_point[1])])
+            cur_source_point = offset_inv_map(cur_source_point, cur_offset)
+            source_points = source_points + cur_source_point
+        top_points = source_points
+    return source_points
+
+def plot_according_to_point(vis_attr, im, source_points, map_h, map_w, color=[255,0,0]):
+    plot_area = vis_attr['plot_area']
+    for idx, cur_source_point in enumerate(source_points):
+        y = np.round((cur_source_point[0] + 0.5) * im.shape[0] / map_h).astype('i')
+        x = np.round((cur_source_point[1] + 0.5) * im.shape[1] / map_w).astype('i')
+
+        if x < 0 or y < 0 or x > im.shape[1]-1 or y > im.shape[0]-1:
+            continue
+        y = min(y, im.shape[0] - vis_attr['plot_area'] - 1)
+        x = min(x, im.shape[1] - vis_attr['plot_area'] - 1)
+        y = max(y, vis_attr['plot_area'])
+        x = max(x, vis_attr['plot_area'])
+        im[y-plot_area:y+plot_area+1, x-plot_area:x+plot_area+1, :] = np.tile(
+            np.reshape(color, (1, 1, 3)), (2*plot_area+1, 2*plot_area+1, 1)
+        )
+    return im
+
+
+
+def show_dpsroi_offset(im, boxes, offset, classes, trans_std=0.1):
+    plt.cla
+    for idx, bbox in enumerate(boxes):
+        plt.figure(idx+1)
+        plt.axis("off")
+        plt.imshow(im)
+
+        offset_w = np.squeeze(offset[idx, classes[idx]*2, :, :]) * trans_std
+        offset_h = np.squeeze(offset[idx, classes[idx]*2+1, :, :]) * trans_std
+        x1 = int(bbox[0])
+        y1 = int(bbox[1])
+        x2 = int(bbox[2])
+        y2 = int(bbox[3])
+        roi_width = x2-x1+1
+        roi_height = y2-y1+1
+        part_size = offset_w.shape[0]
+        bin_size_w = roi_width / part_size
+        bin_size_h = roi_height / part_size
+        show_boxes_simple(bbox, color='b')
+        for ih in range(part_size):
+            for iw in range(part_size):
+                sub_box = np.array([x1+iw*bin_size_w, y1+ih*bin_size_h,
+                                    x1+(iw+1)*bin_size_w, y1+(ih+1)*bin_size_h])
+                sub_offset = offset_h[ih, iw] * np.array([0, 1, 0, 1]) * roi_height \
+                             + offset_w[ih, iw] * np.array([1, 0, 1, 0]) * roi_width
+                sub_box = sub_box + sub_offset
+                show_boxes_simple(sub_box)
+        plt.show()
+
+def show_dconv_offset(im, all_offset, step=[2, 2], filter_size=3,
+                      dilation=2, pad=2, plot_area=2, plot_level=3):
+    vis_attr = {'filter_size': filter_size, 'dilation': dilation, 'pad': pad,
+                'plot_area': plot_area, 'plot_level': plot_level}
+
+    map_h = all_offset[0].shape[2]
+    map_w = all_offset[0].shape[3]
+
+    step_h = step[0]
+    step_w = step[1]
+    start_h = np.round(step_h / 2)
+    start_w = np.round(step_w / 2)
+
+    plt.figure()
+    for im_h in range(start_h, map_h, step_h):
+        for im_w in range(start_w, map_w, step_w):
+            target_point = np.array([im_h, im_w])
+            source_y = np.round(target_point[0] * im.shape[0] / map_h)
+            source_x = np.round(target_point[1] * im.shape[1] / map_w)
+            if source_y < plot_area or source_x < plot_area \
+                    or source_y >= im.shape[0] - plot_area or source_x >= im.shape[1] - plot_area:
+                continue
+
+            cur_im = np.copy(im)
+            source_points = get_bottom_position(vis_attr, [target_point], all_offset)
+            cur_im = plot_according_to_point(vis_attr, cur_im, source_points, map_h, map_w)
+            cur_im[source_y-plot_area:source_y+plot_area+1, source_x-plot_area:source_x+plot_area+1, :] = \
+                np.tile(np.reshape([0, 255, 0], (1, 1, 3)), (2*plot_area+1, 2*plot_area+1, 1))
+
+
+            plt.axis("off")
+            plt.imshow(cur_im)
+            plt.show(block=False)
+            plt.pause(0.01)
+            plt.clf()
diff --git a/rfcn/core/module.py b/rfcn/core/module.py
index 42b067e..25924fb 100644
--- a/rfcn/core/module.py
+++ b/rfcn/core/module.py
@@ -741,7 +741,7 @@ def __init__(self, symbol, data_names, label_names,
         if fixed_param_prefix is not None:
             for name in self._symbol.list_arguments():
                 for prefix in self._fixed_param_prefix:
-                    if name.startswith(prefix):
+                    if prefix in name:
                         fixed_param_names.append(name)
         self._fixed_param_names = fixed_param_names
         self._preload_opt_states = None
diff --git a/rfcn/deform_conv_demo.py b/rfcn/deform_conv_demo.py
new file mode 100644
index 0000000..0cdd15d
--- /dev/null
+++ b/rfcn/deform_conv_demo.py
@@ -0,0 +1,81 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Yi Li, Haocheng Zhang
+# --------------------------------------------------------
+
+import _init_paths
+import os
+import sys
+import pprint
+import cv2
+from config.config import config, update_config
+from utils.image import resize, transform
+import numpy as np
+# get config
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+cur_path = os.path.abspath(os.path.dirname(__file__))
+update_config(cur_path + '/../experiments/rfcn/cfgs/deform_conv_demo.yaml')
+
+sys.path.insert(0, os.path.join(cur_path, '../external/mxnet', config.MXNET_VERSION))
+import mxnet as mx
+from core.tester import Predictor
+from symbols import *
+from utils.load_model import load_param
+from utils.show_offset import show_dconv_offset
+
+def main():
+    # get symbol
+    pprint.pprint(config)
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol(config, is_train=False)
+
+    # load demo data
+    image_names = ['000240.jpg', '000437.jpg', '004072.jpg', '007912.jpg']
+    image_all = []
+    data = []
+    for im_name in image_names:
+        assert os.path.exists(cur_path + '/../demo/deform_conv/' + im_name), \
+            ('%s does not exist'.format('../demo/deform_conv/' + im_name))
+        im = cv2.imread(cur_path + '/../demo/deform_conv/' + im_name, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+        image_all.append(im)
+        target_size = config.SCALES[0][0]
+        max_size = config.SCALES[0][1]
+        im, im_scale = resize(im, target_size, max_size, stride=config.network.IMAGE_STRIDE)
+        im_tensor = transform(im, config.network.PIXEL_MEANS)
+        im_info = np.array([[im_tensor.shape[2], im_tensor.shape[3], im_scale]], dtype=np.float32)
+        data.append({'data': im_tensor, 'im_info': im_info})
+
+    # get predictor
+    data_names = ['data', 'im_info']
+    label_names = []
+    data = [[mx.nd.array(data[i][name]) for name in data_names] for i in xrange(len(data))]
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]]
+    provide_data = [[(k, v.shape) for k, v in zip(data_names, data[i])] for i in xrange(len(data))]
+    provide_label = [None for i in xrange(len(data))]
+    arg_params, aux_params = load_param(cur_path + '/../model/deform_conv', 0, process=True)
+    predictor = Predictor(sym, data_names, label_names,
+                          context=[mx.gpu(0)], max_data_shapes=max_data_shape,
+                          provide_data=provide_data, provide_label=provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # test
+    for idx, _ in enumerate(image_names):
+        data_batch = mx.io.DataBatch(data=[data[idx]], label=[], pad=0, index=idx,
+                                     provide_data=[[(k, v.shape) for k, v in zip(data_names, data[idx])]],
+                                     provide_label=[None])
+
+        output = predictor.predict(data_batch)
+        res5a_offset = output[0]['res5a_branch2b_offset_output'].asnumpy()
+        res5b_offset = output[0]['res5b_branch2b_offset_output'].asnumpy()
+        res5c_offset = output[0]['res5c_branch2b_offset_output'].asnumpy()
+
+        im = image_all[idx]
+        im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+        show_dconv_offset(im, [res5c_offset, res5b_offset, res5a_offset])
+
+if __name__ == '__main__':
+    main()
diff --git a/rfcn/deform_psroi_demo.py b/rfcn/deform_psroi_demo.py
new file mode 100644
index 0000000..4c02a4c
--- /dev/null
+++ b/rfcn/deform_psroi_demo.py
@@ -0,0 +1,79 @@
+import _init_paths
+
+import os
+import sys
+import pprint
+import cv2
+from config.config import config, update_config
+from utils.image import resize, transform
+import numpy as np
+# get config
+os.environ['PYTHONUNBUFFERED'] = '1'
+os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
+os.environ['MXNET_ENABLE_GPU_P2P'] = '0'
+cur_path = os.path.abspath(os.path.dirname(__file__))
+update_config(cur_path + '/../experiments/rfcn/cfgs/deform_psroi_demo.yaml')
+
+sys.path.insert(0, os.path.join(cur_path, '../external/mxnet', config.MXNET_VERSION))
+import mxnet as mx
+from core.tester import Predictor
+from symbols import *
+from utils.load_model import load_param
+from utils.show_offset import show_dpsroi_offset
+
+def main():
+    # get symbol
+    pprint.pprint(config)
+    sym_instance = eval(config.symbol + '.' + config.symbol)()
+    sym = sym_instance.get_symbol_rfcn(config, is_train=False)
+
+    # load demo data
+    image_names = ['000057.jpg', '000149.jpg', '000351.jpg', '002535.jpg']
+    image_all = []
+    # ground truth boxes
+    gt_boxes_all = [np.array([[132, 52, 384, 357]]), np.array([[113, 1, 350, 360]]),
+                    np.array([[0, 27, 329, 155]]), np.array([[8, 40, 499, 289]])]
+    gt_classes_all = [np.array([3]), np.array([16]), np.array([7]), np.array([12])]
+    data = []
+    for idx, im_name in enumerate(image_names):
+        assert os.path.exists(cur_path + '/../demo/deform_psroi/' + im_name), \
+            ('%s does not exist'.format('../demo/deform_psroi/' + im_name))
+        im = cv2.imread(cur_path + '/../demo/deform_psroi/' + im_name, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+        image_all.append(im)
+        target_size = config.SCALES[0][0]
+        max_size = config.SCALES[0][1]
+        im, im_scale = resize(im, target_size, max_size, stride=config.network.IMAGE_STRIDE)
+        im_tensor = transform(im, config.network.PIXEL_MEANS)
+        gt_boxes = gt_boxes_all[idx]
+        gt_boxes = np.round(gt_boxes * im_scale)
+        data.append({'data': im_tensor, 'rois': np.hstack((np.zeros((gt_boxes.shape[0], 1)), gt_boxes))})
+
+    # get predictor
+    data_names = ['data', 'rois']
+    label_names = []
+    data = [[mx.nd.array(data[i][name]) for name in data_names] for i in xrange(len(data))]
+    max_data_shape = [[('data', (1, 3, max([v[0] for v in config.SCALES]), max([v[1] for v in config.SCALES])))]]
+    provide_data = [[(k, v.shape) for k, v in zip(data_names, data[i])] for i in xrange(len(data))]
+    provide_label = [None for i in xrange(len(data))]
+    arg_params, aux_params = load_param(cur_path + '/../model/deform_psroi', 0, process=True)
+    predictor = Predictor(sym, data_names, label_names,
+                          context=[mx.gpu(0)], max_data_shapes=max_data_shape,
+                          provide_data=provide_data, provide_label=provide_label,
+                          arg_params=arg_params, aux_params=aux_params)
+
+    # test
+    for idx, _ in enumerate(image_names):
+        data_batch = mx.io.DataBatch(data=[data[idx]], label=[], pad=0, index=idx,
+                                     provide_data=[[(k, v.shape) for k, v in zip(data_names, data[idx])]],
+                                     provide_label=[None])
+
+        output = predictor.predict(data_batch)
+        cls_offset = output[0]['rfcn_cls_offset_output'].asnumpy()
+
+        im = image_all[idx]
+        im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+        boxes = gt_boxes_all[idx]
+        show_dpsroi_offset(im, boxes, cls_offset, gt_classes_all[idx])
+
+if __name__ == '__main__':
+    main()
diff --git a/rfcn/operator_cxx/deformable_convolution-inl.h b/rfcn/operator_cxx/deformable_convolution-inl.h
index 56d3aed..16ee102 100644
--- a/rfcn/operator_cxx/deformable_convolution-inl.h
+++ b/rfcn/operator_cxx/deformable_convolution-inl.h
@@ -189,9 +189,6 @@ namespace mxnet {
         Tensor<xpu, 1, DType> data_grad = in_grad[conv::kData].FlatTo1D<xpu, DType>(s);
         data_grad = 0;
 
-        //Tensor<xpu, 1, DType> coord_grad = in_grad[conv::kOffset].FlatTo1D<xpu, DType>(s);
-        //coord_grad = 1;
-
 
         for (index_t n = 0; n < num_; ++n) {
           Tensor<xpu, 3, DType> out_grad_3d = out_grad_4d[n];
@@ -268,7 +265,7 @@ namespace mxnet {
         col_buffer_size_ = kernel_dim_ * group_ * conv_out_spatial_dim_;
         // input/output image size (#channels * height * width)
         input_dim_ = ishape.ProdShape(1, ishape.ndim());
-				input_offset_dim_ = ishape.ProdShape(1, offset_shape.ndim());
+        input_offset_dim_ = offset_shape.ProdShape(1, offset_shape.ndim());
         output_dim_ = oshape.ProdShape(1, oshape.ndim());
         num_kernels_im2col_ = conv_in_channels_ * conv_out_spatial_dim_;
         num_kernels_col2im_ = input_dim_;
diff --git a/rfcn/operator_cxx/deformable_convolution.cc b/rfcn/operator_cxx/deformable_convolution.cc
index a5c3880..a5916a5 100644
--- a/rfcn/operator_cxx/deformable_convolution.cc
+++ b/rfcn/operator_cxx/deformable_convolution.cc
@@ -18,7 +18,6 @@ Operator* CreateOp<cpu>(DeformableConvolutionParam param, int dtype,
                         std::vector<TShape> *out_shape,
                         Context ctx) {
   Operator *op = NULL;
-  // If 1D convolution, use MXNet implementation
   MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
     op = new DeformableConvolutionOp<cpu, DType>(param);
   })
@@ -76,38 +75,8 @@ along the first dimension. Next compute the convolution on the *i*-th part of
 the data with the *i*-th weight part. The output is obtained by concating all
 the *g* results.
 
-1-D convolution does not have *height* dimension but only *width* in space.
-
-- **data**: *(batch_size, channel, width)*
-- **weight**: *(num_filter, channel, kernel[0])*
-- **bias**: *(num_filter,)*
-- **out**: *(batch_size, num_filter, out_width)*.
-
-3-D convolution adds an additional *depth* dimension besides *height* and
-*width*. The shapes are
-
-- **data**: *(batch_size, channel, depth, height, width)*
-- **weight**: *(num_filter, channel, kernel[0], kernel[1], kernel[2])*
-- **bias**: *(num_filter,)*
-- **out**: *(batch_size, num_filter, out_depth, out_height, out_width)*.
-
 Both ``weight`` and ``bias`` are learnable parameters.
 
-There are other options to tune the performance.
-
-- **cudnn_tune**: enable this option leads to higher startup time but may give
-  faster speed. Options are
-
-  - **off**: no tuning
-  - **limited_workspace**:run test and pick the fastest algorithm that doesn't
-    exceed workspace limit.
-  - **fastest**: pick the fastest algorithm and ignore workspace limit.
-  - **None** (default): the behavior is determined by environment variable
-    ``MXNET_CUDNN_AUTOTUNE_DEFAULT``. 0 for off, 1 for limited workspace
-    (default), 2 for fastest.
-
-- **workspace**: A large number leads to more (GPU) memory usage but may improve
-  the performance.
 
 )code" ADD_FILELINE)
 .add_argument("data", "NDArray-or-Symbol", "Input data to the DeformableConvolutionOp.")
diff --git a/rfcn/operator_cxx/deformable_convolution.cu b/rfcn/operator_cxx/deformable_convolution.cu
index db5ce55..59948fd 100644
--- a/rfcn/operator_cxx/deformable_convolution.cu
+++ b/rfcn/operator_cxx/deformable_convolution.cu
@@ -10,20 +10,20 @@
 #include <vector>
 
 namespace mxnet {
-	namespace op {
+  namespace op {
 
-		template<>
-		Operator* CreateOp<gpu>(DeformableConvolutionParam param, int dtype,
-			std::vector<TShape> *in_shape,
-			std::vector<TShape> *out_shape,
-			Context ctx) {
-			Operator *op = NULL;
-			MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
-				op = new DeformableConvolutionOp<gpu, DType>(param);
-			})
-				return op;
-		}
+    template<>
+    Operator* CreateOp<gpu>(DeformableConvolutionParam param, int dtype,
+      std::vector<TShape> *in_shape,
+      std::vector<TShape> *out_shape,
+      Context ctx) {
+      Operator *op = NULL;
+      MSHADOW_REAL_TYPE_SWITCH(dtype, DType, {
+        op = new DeformableConvolutionOp<gpu, DType>(param);
+      })
+        return op;
+    }
 
-	}  // namespace op
+  }  // namespace op
 }  // namespace mxnet
 
diff --git a/rfcn/operator_cxx/deformable_psroi_pooling.cu b/rfcn/operator_cxx/deformable_psroi_pooling.cu
index 279080a..5b8f361 100644
--- a/rfcn/operator_cxx/deformable_psroi_pooling.cu
+++ b/rfcn/operator_cxx/deformable_psroi_pooling.cu
@@ -75,12 +75,12 @@ namespace mshadow {
         int n = index / pooled_width / pooled_height / output_dim;
 
         // [start, end) interval for spatial sampling
-        bottom_rois += n * 5;
-        int roi_batch_ind = bottom_rois[0];
-        DType roi_start_w = static_cast<DType>(round(bottom_rois[1])) * spatial_scale - 0.5;
-        DType roi_start_h = static_cast<DType>(round(bottom_rois[2])) * spatial_scale - 0.5;
-        DType roi_end_w = static_cast<DType>(round(bottom_rois[3]) + 1.) * spatial_scale - 0.5;
-        DType roi_end_h = static_cast<DType>(round(bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+        const DType* offset_bottom_rois = bottom_rois + n * 5;
+        int roi_batch_ind = offset_bottom_rois[0];
+        DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+        DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+        DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+        DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
 
         // Force too small ROIs to be 1x1
         DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
@@ -115,7 +115,7 @@ namespace mshadow {
         gw = min(max(gw, 0), group_size - 1);
         gh = min(max(gh, 0), group_size - 1);
 
-        bottom_data += (roi_batch_ind * channels) * height * width;
+        const DType* offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;
         for (int ih = 0; ih < sample_per_part; ih++) {
           for (int iw = 0; iw < sample_per_part; iw++) {
             DType w = wstart + iw*sub_bin_size_w;
@@ -127,7 +127,7 @@ namespace mshadow {
             w = min(max(w, 0.), width - 1.);
             h = min(max(h, 0.), height - 1.);
             int c = (ctop*group_size + gh)*group_size + gw;
-            DType val = bilinear_interp(bottom_data + c*height*width, w, h, width, height);
+            DType val = bilinear_interp(offset_bottom_data + c*height*width, w, h, width, height);
             sum += val;
             count++;
           }
@@ -206,12 +206,12 @@ namespace mshadow {
         int n = index / pooled_width / pooled_height / output_dim;
 
         // [start, end) interval for spatial sampling
-        bottom_rois += n * 5;
-        int roi_batch_ind = bottom_rois[0];
-        DType roi_start_w = static_cast<DType>(round(bottom_rois[1])) * spatial_scale - 0.5;
-        DType roi_start_h = static_cast<DType>(round(bottom_rois[2])) * spatial_scale - 0.5;
-        DType roi_end_w = static_cast<DType>(round(bottom_rois[3]) + 1.) * spatial_scale - 0.5;
-        DType roi_end_h = static_cast<DType>(round(bottom_rois[4]) + 1.) * spatial_scale - 0.5;
+        const DType* offset_bottom_rois = bottom_rois + n * 5;
+        int roi_batch_ind = offset_bottom_rois[0];
+        DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;
+        DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;
+        DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;
+        DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;
 
         // Force too small ROIs to be 1x1
         DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
@@ -243,8 +243,8 @@ namespace mshadow {
           continue;
         }
         DType diff_val = top_diff[index] / top_count[index];
-        bottom_data += roi_batch_ind * channels * height * width;
-        bottom_data_diff += roi_batch_ind * channels * height * width;
+        const DType* offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;
+        DType* offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;
         int gw = floor(static_cast<DType>(pw)* group_size / pooled_width);
         int gh = floor(static_cast<DType>(ph)* group_size / pooled_height);
         gw = min(max(gw, 0), group_size - 1);
@@ -271,21 +271,19 @@ namespace mshadow {
             DType q01 = (1 - dist_x)*dist_y;
             DType q10 = dist_x*(1 - dist_y);
             DType q11 = dist_x*dist_y;
-            DType* offset_bottom_data_diff = bottom_data_diff + c * height * width;
-            // mxnet_gpu_atomic_add(diff_val, offset_bottom_diff + bottom_index);
-            atomicAdd(offset_bottom_data_diff + y0*width + x0, q00*diff_val);
-            atomicAdd(offset_bottom_data_diff + y1*width + x0, q01*diff_val);
-            atomicAdd(offset_bottom_data_diff + y0*width + x1, q10*diff_val);
-            atomicAdd(offset_bottom_data_diff + y1*width + x1, q11*diff_val);
+            int bottom_index_base = c * height *width;
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y0*width + x0, q00*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y1*width + x0, q01*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y0*width + x1, q10*diff_val);
+            atomicAdd(offset_bottom_data_diff + bottom_index_base + y1*width + x1, q11*diff_val);
 
             if (no_trans) {
               continue;
             }
-            const DType* offset_bottom_data = bottom_data + c * height * width;
-            DType U00 = offset_bottom_data[y0*width + x0];
-            DType U01 = offset_bottom_data[y1*width + x0];
-            DType U10 = offset_bottom_data[y0*width + x1];
-            DType U11 = offset_bottom_data[y1*width + x1];
+            DType U00 = offset_bottom_data[bottom_index_base + y0*width + x0];
+            DType U01 = offset_bottom_data[bottom_index_base + y1*width + x0];
+            DType U10 = offset_bottom_data[bottom_index_base + y0*width + x1];
+            DType U11 = offset_bottom_data[bottom_index_base + y1*width + x1];
             DType diff_x = (U11*dist_y + U10*(1 - dist_y) - U01*dist_y - U00*(1 - dist_y))
               *trans_std*diff_val;
             diff_x *= roi_width;
diff --git a/rfcn/operator_cxx/nn/deformable_im2col.cuh b/rfcn/operator_cxx/nn/deformable_im2col.cuh
index 71d699a..d9e7b97 100644
--- a/rfcn/operator_cxx/nn/deformable_im2col.cuh
+++ b/rfcn/operator_cxx/nn/deformable_im2col.cuh
@@ -48,18 +48,18 @@
  *
  ***************** END Caffe Copyright Notice and Disclaimer ********************
  *
- * Copyright (c) 2017 by Contributors
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
  * \file deformable_im2col.cuh
  * \brief Function definitions of converting an image to
- * column matrix based on kernel, padding, and dilation.
- * These functions are mainly used in convolution operators.
- * The implementation of the im2col and col2im algorithms
- * are copied from Caffe with minor interface modifications
- * adapting to MXNet data structures.
+ * column matrix based on kernel, padding, dilation, and offset.
+ * These functions are mainly used in deformable convolution operators.
+ * \ref: https://arxiv.org/abs/1703.06211
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
  */
 
-#ifndef MXNET_OPERATOR_NN_CONTRIB_DEFORMABLE_IM2COL_CUH_
-#define MXNET_OPERATOR_NN_CONTRIB_DEFORMABLE_IM2COL_CUH_
+#ifndef MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
+#define MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
 
 #include <mxnet/base.h>
 #include <mxnet/operator.h>
@@ -209,7 +209,7 @@ __device__ DType get_coordinate_weight(DType argmax_h, DType argmax_w,
 
 
 /*!
- * \brief im2col gpu kernel.
+ * \brief deformable_im2col gpu kernel.
  * DO NOT call this directly. Use wrapper function im2col() instead;
  */
 template <typename DType>
@@ -266,14 +266,18 @@ __global__ void deformable_im2col_gpu_kernel(const int n, const DType* data_im,
 
 
 
-/*!\brief im2col gpu version
+/*!\brief
+ * cpu function of deformable_im2col algorithm
  * \param s device stream
  * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
  * \param col_shape column buffer shape (#channels, output_im_height, output_im_width, ...)
  * \param kernel_shape kernel filter shape
  * \param pad pad shape
  * \param stride stride shape
  * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
  * \param data_col column buffer pointer
  */
 template <typename DType>
@@ -306,6 +310,7 @@ inline void deformable_im2col(mshadow::Stream<gpu>* s,
 
 
 /*!
+* \brief deformable_col2im gpu kernel.
 * \brief DO NOT call this directly. Use wrapper function deformable_col2im() instead;
 */
 template <typename DType>
@@ -360,16 +365,18 @@ __global__ void deformable_col2im_gpu_kernel(const int n, const DType* data_col,
 
 
 /*!\brief
- * gpu function of col2im algorithm
+ * gpu function of deformable_col2im algorithm
  * \param s device stream
  * \param data_col start pointer of the column buffer to be filled
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
  * \param im_shape input image shape in dimensions (N, C, H, W,)
  * \param col_shape column buffer shape
  * \param kernel_shape kernel filter shape
  * \param pad pad shape
  * \param stride stride shape
  * \param dilation dilation shape
- * \param data_im pointer of a image (C, H, W,...) in the image batch
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_im pointer of a image (C, H, W,...) in the image batch
  */
 template <typename DType>
 inline void deformable_col2im(mshadow::Stream<gpu>* s,
@@ -405,8 +412,9 @@ inline void deformable_col2im(mshadow::Stream<gpu>* s,
 
 
 /*!
-* \brief DO NOT call this directly. Use wrapper function deformable_col2im_coord() instead;
-*/
+ * \brief deformable_col2im_coord gpu kernel.
+ * \brief DO NOT call this directly. Use wrapper function deformable_col2im_coord() instead;
+ */
 template <typename DType>
 __global__ void deformable_col2im_coord_gpu_kernel(const int n, const DType* data_col, 
   const DType* data_im, const DType* data_offset,
@@ -464,7 +472,21 @@ __global__ void deformable_col2im_coord_gpu_kernel(const int n, const DType* dat
   }
 }
 
-
+/*!\brief
+ * gpu function of deformable_col2im_coord algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_offset pointer of the offset (C, H, W,...) in the offset batch
+ */
 template <typename DType>
 inline void deformable_col2im_coord(mshadow::Stream<gpu>* s,
   const DType* data_col, const DType* data_im, const DType* data_offset, const TShape& im_shape,
@@ -500,4 +522,4 @@ inline void deformable_col2im_coord(mshadow::Stream<gpu>* s,
 }  // namespace op
 }  // namespace mxnet
 
-#endif  // MXNET_OPERATOR_NN_IM2COL_CUH_
+#endif  // MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_CUH_
diff --git a/rfcn/operator_cxx/nn/deformable_im2col.h b/rfcn/operator_cxx/nn/deformable_im2col.h
index fa9c8ef..93a5551 100644
--- a/rfcn/operator_cxx/nn/deformable_im2col.h
+++ b/rfcn/operator_cxx/nn/deformable_im2col.h
@@ -48,18 +48,18 @@
  *
  ***************** END Caffe Copyright Notice and Disclaimer ********************
  *
- * Copyright (c) 2017 by Contributors
+ * Copyright (c) 2017 Microsoft
+ * Licensed under The Apache-2.0 License [see LICENSE for details]
  * \file deformable_im2col.h
  * \brief Function definitions of converting an image to
- * column matrix based on kernel, padding, and dilation.
- * These functions are mainly used in convolution operators.
- * The implementation of the im2col and col2im algorithms
- * are copied from Caffe with minor interface modifications
- * adapting to MXNet data structures.
+ * column matrix based on kernel, padding, dilation, and offset.
+ * These functions are mainly used in deformable convolution operators.
+ * \ref: https://arxiv.org/abs/1703.06211
+ * \author Yuwen Xiong, Haozhi Qi, Jifeng Dai
  */
 
-#ifndef MXNET_OPERATOR_NN_CONTRIB_DEFORMABLE_IM2COL_H_
-#define MXNET_OPERATOR_NN_CONTRIB_DEFORMABLE_IM2COL_H_
+#ifndef MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
+#define MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
 
 #include <mxnet/base.h>
 #include <mxnet/operator.h>
@@ -70,16 +70,19 @@
 namespace mxnet {
 namespace op {
 
-/*!
- * \brief cpu function of im2col algorithm
- * \param data_im pointer of a image (C, H, W,...) in the image batch
+/*!\brief 
+ * cpu function of deformable_im2col algorithm
+ * \param s device stream
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
  * \param im_shape input image shape in dimensions (N, C, H, W,)
- * \param col_shape column buffer shape
+ * \param col_shape column buffer shape (#channels, output_im_height, output_im_width, ...)
  * \param kernel_shape kernel filter shape
  * \param pad pad shape
  * \param stride stride shape
  * \param dilation dilation shape
- * \param data_col start pointer of the column buffer to be filled
+ * \param deformable_group #offset group that deformable convolution use
+ * \param data_col column buffer pointer
  */
 template <typename DType>
 inline void deformable_im2col(mshadow::Stream<cpu>* s,
@@ -96,17 +99,17 @@ inline void deformable_im2col(mshadow::Stream<cpu>* s,
 
 
 /*!\brief
- * cpu function of col2im algorithm
+ * cpu function of deformable_col2im algorithm
  * \param s device stream
  * \param data_col start pointer of the column buffer to be filled
- * \param data_im start pointer of the image data
- * \param data_offset start pointer of the offset data
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
  * \param im_shape input image shape in dimensions (N, C, H, W,)
  * \param col_shape column buffer shape
  * \param kernel_shape kernel filter shape
  * \param pad pad shape
  * \param stride stride shape
  * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
  * \param grad_im pointer of a image (C, H, W,...) in the image batch
  */
 template <typename DType>
@@ -121,6 +124,22 @@ inline void deformable_col2im(mshadow::Stream<cpu>* s,
 }
 
 
+/*!\brief
+ * cpu function of deformable_col2im_coord algorithm
+ * \param s device stream
+ * \param data_col start pointer of the column buffer to be filled
+ * \param data_im pointer of an image (C, H, W, ...) in the image batch
+ * \param data_offset pointer of offset (C, H, W, ...) in the offset batch
+ * \param im_shape input image shape in dimensions (N, C, H, W,)
+ * \param col_shape column buffer shape
+ * \param kernel_shape kernel filter shape
+ * \param pad pad shape
+ * \param stride stride shape
+ * \param dilation dilation shape
+ * \param deformable_group #offset group that deformable convolution use
+ * \param grad_offset pointer of the offset (C, H, W,...) in the offset batch
+ */
+
 template <typename DType>
 inline void deformable_col2im_coord(mshadow::Stream<cpu>* s,
   const DType* data_col, const DType* data_im, const DType* data_offset, const TShape& im_shape,
@@ -135,4 +154,4 @@ inline void deformable_col2im_coord(mshadow::Stream<cpu>* s,
 #ifdef __CUDACC__
 #include "./deformable_im2col.cuh"
 #endif
-#endif  // MXNET_OPERATOR_NN_DEFORMABLE_IM2COL_H_
+#endif  // MXNET_OPERATOR_CONTRIB_NN_DEFORMABLE_IM2COL_H_
diff --git a/rfcn/operator_cxx/psroi_pooling.cu b/rfcn/operator_cxx/psroi_pooling.cu
index 92cf223..43c57ee 100644
--- a/rfcn/operator_cxx/psroi_pooling.cu
+++ b/rfcn/operator_cxx/psroi_pooling.cu
@@ -49,12 +49,12 @@ __global__ void PSROIPoolForwardKernel(
     int n = index / pooled_width / pooled_height / output_dim;
 
     // [start, end) interval for spatial sampling
-    bottom_rois += n * 5;
-    int roi_batch_ind = bottom_rois[0];
-    DType roi_start_w = static_cast<DType>(round(bottom_rois[1])) * spatial_scale;
-    DType roi_start_h = static_cast<DType>(round(bottom_rois[2])) * spatial_scale;
-    DType roi_end_w = static_cast<DType>(round(bottom_rois[3]) + 1.) * spatial_scale;
-    DType roi_end_h = static_cast<DType>(round(bottom_rois[4]) + 1.) * spatial_scale;
+    const DType* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = offset_bottom_rois[0];
+    DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale;
+    DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale;
+    DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale;
+    DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale;
 
     // Force too small ROIs to be 1x1
     DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
@@ -85,12 +85,12 @@ __global__ void PSROIPoolForwardKernel(
     gh = min(max(gh, 0), group_size - 1);
     int c = (ctop*group_size + gh)*group_size + gw;
 
-    bottom_data += (roi_batch_ind * channels + c) * height * width;
+    const DType* offset_bottom_data = bottom_data + (roi_batch_ind * channels + c) * height * width;
     DType out_sum = 0;
     for (int h = hstart; h < hend; ++h){
       for (int w = wstart; w < wend; ++w){
         int bottom_index = h*width + w;
-        out_sum += bottom_data[bottom_index];
+        out_sum += offset_bottom_data[bottom_index];
       }
     }
 
@@ -148,12 +148,12 @@ __global__ void PSROIPoolBackwardAccKernel(
     int n = index / pooled_width / pooled_height / output_dim;
 
     // [start, end) interval for spatial sampling
-    bottom_rois += n * 5;
-    int roi_batch_ind = bottom_rois[0];
-    DType roi_start_w = static_cast<DType>(round(bottom_rois[1])) * spatial_scale;
-    DType roi_start_h = static_cast<DType>(round(bottom_rois[2])) * spatial_scale;
-    DType roi_end_w = static_cast<DType>(round(bottom_rois[3]) + 1.) * spatial_scale;
-    DType roi_end_h = static_cast<DType>(round(bottom_rois[4]) + 1.) * spatial_scale;
+    const DType* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = offset_bottom_rois[0];
+    DType roi_start_w = static_cast<DType>(round(offset_bottom_rois[1])) * spatial_scale;
+    DType roi_start_h = static_cast<DType>(round(offset_bottom_rois[2])) * spatial_scale;
+    DType roi_end_w = static_cast<DType>(round(offset_bottom_rois[3]) + 1.) * spatial_scale;
+    DType roi_end_h = static_cast<DType>(round(offset_bottom_rois[4]) + 1.) * spatial_scale;
 
     // Force too small ROIs to be 1x1
     DType roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
@@ -260,4 +260,4 @@ Operator* CreateOp<gpu>(PSROIPoolingParam param, int dtype) {
 }
 
 }  // namespace op
-}  // namespace mxnet
\ No newline at end of file
+}  // namespace mxnet
diff --git a/rfcn/symbols/__init__.py b/rfcn/symbols/__init__.py
index de546e1..b8ddc54 100644
--- a/rfcn/symbols/__init__.py
+++ b/rfcn/symbols/__init__.py
@@ -1,2 +1,4 @@
 import resnet_v1_101_rfcn
 import resnet_v1_101_rfcn_dcn
+import deform_conv_demo
+import deform_psroi_demo
diff --git a/rfcn/symbols/deform_conv_demo.py b/rfcn/symbols/deform_conv_demo.py
new file mode 100644
index 0000000..b036c12
--- /dev/null
+++ b/rfcn/symbols/deform_conv_demo.py
@@ -0,0 +1,995 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Yuwen Xiong, Xizhou Zhu
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+from operator_py.proposal import *
+from operator_py.proposal_target import *
+from operator_py.box_annotator_ohem import *
+
+
+class deform_conv_demo(Symbol):
+
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 512
+        self.units = (3, 4, 23, 3) # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_v1_conv4(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0), kernel=(1, 1),
+                                              stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0), kernel=(1, 1),
+                                               stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a, act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b, act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a, act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b, act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a, act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b, act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a, act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b, act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a, act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b, act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a, act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b, act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a, act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b, act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a, act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b, act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a, act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b, act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a, act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b, act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a, act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b, act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a, act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b, act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a, act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b, act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a, act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b, act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a, act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b, act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a, act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b, act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a, act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b, act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a, act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b, act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a, act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b, act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a, act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b, act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a, act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b, act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a, act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b, act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a, act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b, act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a, act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b, act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a, act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b, act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+        return res4b22_relu
+        
+    def get_resnet_v1_conv5(self, conv_feat):
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=conv_feat, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=conv_feat, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b_offset = mx.symbol.Convolution(name='res5a_branch2b_offset', data = res5a_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1), cudnn_off=True)
+        res5a_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5a_branch2b', data=res5a_branch2a_relu, offset=res5a_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b_offset = mx.symbol.Convolution(name='res5b_branch2b_offset', data = res5b_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1), cudnn_off=True)
+        res5b_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5b_branch2b', data=res5b_branch2a_relu, offset=res5b_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b_offset = mx.symbol.Convolution(name='res5c_branch2b_offset', data = res5c_branch2a_relu,
+                                                      num_filter=18, pad=(1, 1), kernel=(3, 3), stride=(1, 1), cudnn_off=True)
+        res5c_branch2b = mx.contrib.symbol.DeformableConvolution(name='res5c_branch2b', data=res5c_branch2a_relu, offset=res5c_branch2b_offset,
+                                                                 num_filter=512, pad=(2, 2), kernel=(3, 3), num_deformable_group=1,
+                                                                 stride=(1, 1), dilate=(2, 2), no_bias=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+        return res5c_relu, res5a_branch2b_offset, res5b_branch2b_offset, res5c_branch2b_offset
+
+    def get_rpn(self, conv_feat, num_anchors):
+        rpn_conv = mx.sym.Convolution(
+            data=conv_feat, kernel=(3, 3), pad=(1, 1), num_filter=512, name="rpn_conv_3x3")
+        rpn_relu = mx.sym.Activation(data=rpn_conv, act_type="relu", name="rpn_relu")
+        rpn_cls_score = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=2 * num_anchors, name="rpn_cls_score")
+        rpn_bbox_pred = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=4 * num_anchors, name="rpn_bbox_pred")
+        return rpn_cls_score, rpn_bbox_pred
+
+    def get_symbol(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+            gt_boxes = mx.sym.Variable(name="gt_boxes")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1, res5a_branch2b_offset, res5b_branch2b_offset, res5c_branch2b_offset = self.get_resnet_v1_conv5(conv_feat)
+
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                   normalization='valid', use_ignore=True, ignore_label=-1, name="rpn_cls_prob")
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0, data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_, grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+
+            # ROI proposal
+            rpn_cls_act = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_act")
+            rpn_cls_act_reshape = mx.sym.Reshape(
+                data=rpn_cls_act, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_act_reshape')
+            if cfg.TRAIN.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            # ROI proposal target
+            gt_boxes_reshape = mx.sym.Reshape(data=gt_boxes, shape=(-1, 5), name='gt_boxes_reshape')
+            rois, label, bbox_target, bbox_weight = mx.sym.Custom(rois=rois, gt_boxes=gt_boxes_reshape,
+                                                                  op_type='proposal_target',
+                                                                  num_classes=num_reg_classes,
+                                                                  batch_images=cfg.TRAIN.BATCH_IMAGES,
+                                                                  batch_rois=cfg.TRAIN.BATCH_ROIS,
+                                                                  cfg=cPickle.dumps(cfg),
+                                                                  fg_fraction=cfg.TRAIN.FG_FRACTION)
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+
+
+
+        # conv_new_1
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1", lr_mult=3.0)
+        relu_new_1 = mx.sym.Activation(data=conv_new_1, act_type='relu', name='relu1')
+
+        # rfcn_cls/rfcn_bbox
+        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*num_classes, name="rfcn_cls")
+        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*4*num_reg_classes, name="rfcn_bbox")
+        psroipooled_cls_rois = mx.contrib.sym.PSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls, rois=rois, group_size=7, pooled_size=7,
+                                                   output_dim=num_classes, spatial_scale=0.0625)
+        psroipooled_loc_rois = mx.contrib.sym.PSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox, rois=rois, group_size=7, pooled_size=7,
+                                                   output_dim=8, spatial_scale=0.0625)
+        cls_score = mx.sym.Pooling(name='ave_cls_scors_rois', data=psroipooled_cls_rois, pool_type='avg', global_pool=True, kernel=(7, 7))
+        bbox_pred = mx.sym.Pooling(name='ave_bbox_pred_rois', data=psroipooled_loc_rois, pool_type='avg', global_pool=True, kernel=(7, 7))
+        cls_score = mx.sym.Reshape(name='cls_score_reshape', data=cls_score, shape=(-1, num_classes))
+        bbox_pred = mx.sym.Reshape(name='bbox_pred_reshape', data=bbox_pred, shape=(-1, 4 * num_reg_classes))
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes, roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem, normalization='valid', use_ignore=True, ignore_label=-1)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                rcnn_label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid')
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+                rcnn_label = label
+
+            # reshape output
+            rcnn_label = mx.sym.Reshape(data=rcnn_label, shape=(cfg.TRAIN.BATCH_IMAGES, -1), name='label_reshape')
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes), name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes), name='bbox_loss_reshape')
+            group = mx.sym.Group([rpn_cls_prob, rpn_bbox_loss, cls_prob, bbox_loss, mx.sym.BlockGrad(rcnn_label)])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([res5a_branch2b_offset, res5b_branch2b_offset, res5c_branch2b_offset, rois, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def get_symbol_rpn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1, name="rpn_cls_prob",
+                                                grad_scale=1.0)
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0, data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_, grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+            group = mx.symbol.Group([rpn_cls_prob, rpn_bbox_loss])
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois, score = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois', output_score=True,
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois, score = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois', output_score=True,
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+                group = mx.symbol.Group([rois, score])
+        self.sym = group
+        return group
+
+    def get_symbol_rfcn(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+
+        # input init
+        if is_train:
+            data = mx.symbol.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            label = mx.symbol.Variable(name='label')
+            bbox_target = mx.symbol.Variable(name='bbox_target')
+            bbox_weight = mx.symbol.Variable(name='bbox_weight')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+            label = mx.symbol.Reshape(data=label, shape=(-1,), name='label_reshape')
+            bbox_target = mx.symbol.Reshape(data=bbox_target, shape=(-1, 4 * num_reg_classes), name='bbox_target_reshape')
+            bbox_weight = mx.symbol.Reshape(data=bbox_weight, shape=(-1, 4 * num_reg_classes), name='bbox_weight_reshape')
+        else:
+            data = mx.sym.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        relu1, res5a_branch2b_offset, res5b_branch2b_offset, res5c_branch2b_offset = self.get_resnet_v1_conv5(conv_feat)
+
+        # conv_new_1
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1", lr_mult=3.0)
+        relu_new_1 = mx.sym.Activation(data=conv_new_1, act_type='relu', name='relu1')
+
+        # rfcn_cls/rfcn_bbox
+        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*num_classes, name="rfcn_cls")
+        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*4*num_reg_classes, name="rfcn_bbox")
+        psroipooled_cls_rois = mx.contrib.sym.PSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls, rois=rois, group_size=7, pooled_size=7,
+                                                   output_dim=num_classes, spatial_scale=0.0625)
+        psroipooled_loc_rois = mx.contrib.sym.PSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox, rois=rois, group_size=7, pooled_size=7,
+                                                   output_dim=8, spatial_scale=0.0625)
+        cls_score = mx.sym.Pooling(name='ave_cls_scors_rois', data=psroipooled_cls_rois, pool_type='avg', global_pool=True, kernel=(7, 7))
+        bbox_pred = mx.sym.Pooling(name='ave_bbox_pred_rois', data=psroipooled_loc_rois, pool_type='avg', global_pool=True, kernel=(7, 7))
+        cls_score = mx.sym.Reshape(name='cls_score_reshape', data=cls_score, shape=(-1, num_classes))
+        bbox_pred = mx.sym.Reshape(name='bbox_pred_reshape', data=bbox_pred, shape=(-1, 4 * num_reg_classes))
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes, roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem, normalization='valid', use_ignore=True, ignore_label=-1, grad_scale=1.0)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid', grad_scale=1.0)
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+
+            # reshape output
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes), name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes), name='bbox_loss_reshape')
+            group = mx.sym.Group([cls_prob, bbox_loss, mx.sym.BlockGrad(label)]) if cfg.TRAIN.ENABLE_OHEM else mx.sym.Group([cls_prob, bbox_loss])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([res5a_branch2b_offset, res5b_branch2b_offset, res5c_branch2b_offset, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def init_weight_rpn(self, cfg, arg_params, aux_params):
+        arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_conv_3x3_weight'])
+        arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_conv_3x3_bias'])
+        arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_cls_score_weight'])
+        arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_cls_score_bias'])
+        arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_bbox_pred_weight'])
+        arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_bbox_pred_bias'])
+
+    def init_weight_rfcn(self, cfg, arg_params, aux_params):
+        arg_params['res5a_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_weight'])
+        arg_params['res5a_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5a_branch2b_offset_bias'])
+        arg_params['res5b_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_weight'])
+        arg_params['res5b_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5b_branch2b_offset_bias'])
+        arg_params['res5c_branch2b_offset_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_weight'])
+        arg_params['res5c_branch2b_offset_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['res5c_branch2b_offset_bias'])
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['conv_new_1_bias'])
+        arg_params['rfcn_cls_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rfcn_cls_weight'])
+        arg_params['rfcn_cls_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_cls_bias'])
+        arg_params['rfcn_bbox_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rfcn_bbox_weight'])
+        arg_params['rfcn_bbox_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_bbox_bias'])
+
+    def init_weight(self, cfg, arg_params, aux_params):
+        self.init_weight_rpn(cfg, arg_params, aux_params)
+        self.init_weight_rfcn(cfg, arg_params, aux_params)
diff --git a/rfcn/symbols/deform_psroi_demo.py b/rfcn/symbols/deform_psroi_demo.py
new file mode 100644
index 0000000..1a9cc58
--- /dev/null
+++ b/rfcn/symbols/deform_psroi_demo.py
@@ -0,0 +1,1007 @@
+# --------------------------------------------------------
+# Deformable Convolutional Networks
+# Copyright (c) 2017 Microsoft
+# Licensed under The Apache-2.0 License [see LICENSE for details]
+# Written by Yuwen Xiong, Xizhou Zhu
+# --------------------------------------------------------
+
+import cPickle
+import mxnet as mx
+from utils.symbol import Symbol
+from operator_py.proposal import *
+from operator_py.proposal_target import *
+from operator_py.box_annotator_ohem import *
+
+
+class deform_psroi_demo(Symbol):
+
+    def __init__(self):
+        """
+        Use __init__ to define parameter network needs
+        """
+        self.eps = 1e-5
+        self.use_global_stats = True
+        self.workspace = 512
+        self.units = (3, 4, 23, 3) # use for 101
+        self.filter_list = [256, 512, 1024, 2048]
+
+    def get_resnet_v1_conv4(self, data):
+        conv1 = mx.symbol.Convolution(name='conv1', data=data, num_filter=64, pad=(3, 3), kernel=(7, 7), stride=(2, 2),
+                                      no_bias=True)
+        bn_conv1 = mx.symbol.BatchNorm(name='bn_conv1', data=conv1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale_conv1 = bn_conv1
+        conv1_relu = mx.symbol.Activation(name='conv1_relu', data=scale_conv1, act_type='relu')
+        pool1 = mx.symbol.Pooling(name='pool1', data=conv1_relu, pooling_convention='full', pad=(0, 0), kernel=(3, 3),
+                                  stride=(2, 2), pool_type='max')
+        res2a_branch1 = mx.symbol.Convolution(name='res2a_branch1', data=pool1, num_filter=256, pad=(0, 0), kernel=(1, 1),
+                                              stride=(1, 1), no_bias=True)
+        bn2a_branch1 = mx.symbol.BatchNorm(name='bn2a_branch1', data=res2a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale2a_branch1 = bn2a_branch1
+        res2a_branch2a = mx.symbol.Convolution(name='res2a_branch2a', data=pool1, num_filter=64, pad=(0, 0), kernel=(1, 1),
+                                               stride=(1, 1), no_bias=True)
+        bn2a_branch2a = mx.symbol.BatchNorm(name='bn2a_branch2a', data=res2a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2a = bn2a_branch2a
+        res2a_branch2a_relu = mx.symbol.Activation(name='res2a_branch2a_relu', data=scale2a_branch2a, act_type='relu')
+        res2a_branch2b = mx.symbol.Convolution(name='res2a_branch2b', data=res2a_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2a_branch2b = mx.symbol.BatchNorm(name='bn2a_branch2b', data=res2a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2b = bn2a_branch2b
+        res2a_branch2b_relu = mx.symbol.Activation(name='res2a_branch2b_relu', data=scale2a_branch2b, act_type='relu')
+        res2a_branch2c = mx.symbol.Convolution(name='res2a_branch2c', data=res2a_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2a_branch2c = mx.symbol.BatchNorm(name='bn2a_branch2c', data=res2a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2a_branch2c = bn2a_branch2c
+        res2a = mx.symbol.broadcast_add(name='res2a', *[scale2a_branch1, scale2a_branch2c])
+        res2a_relu = mx.symbol.Activation(name='res2a_relu', data=res2a, act_type='relu')
+        res2b_branch2a = mx.symbol.Convolution(name='res2b_branch2a', data=res2a_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2a = mx.symbol.BatchNorm(name='bn2b_branch2a', data=res2b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2a = bn2b_branch2a
+        res2b_branch2a_relu = mx.symbol.Activation(name='res2b_branch2a_relu', data=scale2b_branch2a, act_type='relu')
+        res2b_branch2b = mx.symbol.Convolution(name='res2b_branch2b', data=res2b_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2b_branch2b = mx.symbol.BatchNorm(name='bn2b_branch2b', data=res2b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2b = bn2b_branch2b
+        res2b_branch2b_relu = mx.symbol.Activation(name='res2b_branch2b_relu', data=scale2b_branch2b, act_type='relu')
+        res2b_branch2c = mx.symbol.Convolution(name='res2b_branch2c', data=res2b_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2b_branch2c = mx.symbol.BatchNorm(name='bn2b_branch2c', data=res2b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2b_branch2c = bn2b_branch2c
+        res2b = mx.symbol.broadcast_add(name='res2b', *[res2a_relu, scale2b_branch2c])
+        res2b_relu = mx.symbol.Activation(name='res2b_relu', data=res2b, act_type='relu')
+        res2c_branch2a = mx.symbol.Convolution(name='res2c_branch2a', data=res2b_relu, num_filter=64, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2a = mx.symbol.BatchNorm(name='bn2c_branch2a', data=res2c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2a = bn2c_branch2a
+        res2c_branch2a_relu = mx.symbol.Activation(name='res2c_branch2a_relu', data=scale2c_branch2a, act_type='relu')
+        res2c_branch2b = mx.symbol.Convolution(name='res2c_branch2b', data=res2c_branch2a_relu, num_filter=64, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn2c_branch2b = mx.symbol.BatchNorm(name='bn2c_branch2b', data=res2c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2b = bn2c_branch2b
+        res2c_branch2b_relu = mx.symbol.Activation(name='res2c_branch2b_relu', data=scale2c_branch2b, act_type='relu')
+        res2c_branch2c = mx.symbol.Convolution(name='res2c_branch2c', data=res2c_branch2b_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn2c_branch2c = mx.symbol.BatchNorm(name='bn2c_branch2c', data=res2c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale2c_branch2c = bn2c_branch2c
+        res2c = mx.symbol.broadcast_add(name='res2c', *[res2b_relu, scale2c_branch2c])
+        res2c_relu = mx.symbol.Activation(name='res2c_relu', data=res2c, act_type='relu')
+        res3a_branch1 = mx.symbol.Convolution(name='res3a_branch1', data=res2c_relu, num_filter=512, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch1 = mx.symbol.BatchNorm(name='bn3a_branch1', data=res3a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale3a_branch1 = bn3a_branch1
+        res3a_branch2a = mx.symbol.Convolution(name='res3a_branch2a', data=res2c_relu, num_filter=128, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn3a_branch2a = mx.symbol.BatchNorm(name='bn3a_branch2a', data=res3a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2a = bn3a_branch2a
+        res3a_branch2a_relu = mx.symbol.Activation(name='res3a_branch2a_relu', data=scale3a_branch2a, act_type='relu')
+        res3a_branch2b = mx.symbol.Convolution(name='res3a_branch2b', data=res3a_branch2a_relu, num_filter=128, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3a_branch2b = mx.symbol.BatchNorm(name='bn3a_branch2b', data=res3a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2b = bn3a_branch2b
+        res3a_branch2b_relu = mx.symbol.Activation(name='res3a_branch2b_relu', data=scale3a_branch2b, act_type='relu')
+        res3a_branch2c = mx.symbol.Convolution(name='res3a_branch2c', data=res3a_branch2b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3a_branch2c = mx.symbol.BatchNorm(name='bn3a_branch2c', data=res3a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale3a_branch2c = bn3a_branch2c
+        res3a = mx.symbol.broadcast_add(name='res3a', *[scale3a_branch1, scale3a_branch2c])
+        res3a_relu = mx.symbol.Activation(name='res3a_relu', data=res3a, act_type='relu')
+        res3b1_branch2a = mx.symbol.Convolution(name='res3b1_branch2a', data=res3a_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2a = mx.symbol.BatchNorm(name='bn3b1_branch2a', data=res3b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2a = bn3b1_branch2a
+        res3b1_branch2a_relu = mx.symbol.Activation(name='res3b1_branch2a_relu', data=scale3b1_branch2a, act_type='relu')
+        res3b1_branch2b = mx.symbol.Convolution(name='res3b1_branch2b', data=res3b1_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b1_branch2b = mx.symbol.BatchNorm(name='bn3b1_branch2b', data=res3b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2b = bn3b1_branch2b
+        res3b1_branch2b_relu = mx.symbol.Activation(name='res3b1_branch2b_relu', data=scale3b1_branch2b, act_type='relu')
+        res3b1_branch2c = mx.symbol.Convolution(name='res3b1_branch2c', data=res3b1_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b1_branch2c = mx.symbol.BatchNorm(name='bn3b1_branch2c', data=res3b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b1_branch2c = bn3b1_branch2c
+        res3b1 = mx.symbol.broadcast_add(name='res3b1', *[res3a_relu, scale3b1_branch2c])
+        res3b1_relu = mx.symbol.Activation(name='res3b1_relu', data=res3b1, act_type='relu')
+        res3b2_branch2a = mx.symbol.Convolution(name='res3b2_branch2a', data=res3b1_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2a = mx.symbol.BatchNorm(name='bn3b2_branch2a', data=res3b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2a = bn3b2_branch2a
+        res3b2_branch2a_relu = mx.symbol.Activation(name='res3b2_branch2a_relu', data=scale3b2_branch2a, act_type='relu')
+        res3b2_branch2b = mx.symbol.Convolution(name='res3b2_branch2b', data=res3b2_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b2_branch2b = mx.symbol.BatchNorm(name='bn3b2_branch2b', data=res3b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2b = bn3b2_branch2b
+        res3b2_branch2b_relu = mx.symbol.Activation(name='res3b2_branch2b_relu', data=scale3b2_branch2b, act_type='relu')
+        res3b2_branch2c = mx.symbol.Convolution(name='res3b2_branch2c', data=res3b2_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b2_branch2c = mx.symbol.BatchNorm(name='bn3b2_branch2c', data=res3b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b2_branch2c = bn3b2_branch2c
+        res3b2 = mx.symbol.broadcast_add(name='res3b2', *[res3b1_relu, scale3b2_branch2c])
+        res3b2_relu = mx.symbol.Activation(name='res3b2_relu', data=res3b2, act_type='relu')
+        res3b3_branch2a = mx.symbol.Convolution(name='res3b3_branch2a', data=res3b2_relu, num_filter=128, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2a = mx.symbol.BatchNorm(name='bn3b3_branch2a', data=res3b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2a = bn3b3_branch2a
+        res3b3_branch2a_relu = mx.symbol.Activation(name='res3b3_branch2a_relu', data=scale3b3_branch2a, act_type='relu')
+        res3b3_branch2b = mx.symbol.Convolution(name='res3b3_branch2b', data=res3b3_branch2a_relu, num_filter=128,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn3b3_branch2b = mx.symbol.BatchNorm(name='bn3b3_branch2b', data=res3b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2b = bn3b3_branch2b
+        res3b3_branch2b_relu = mx.symbol.Activation(name='res3b3_branch2b_relu', data=scale3b3_branch2b, act_type='relu')
+        res3b3_branch2c = mx.symbol.Convolution(name='res3b3_branch2c', data=res3b3_branch2b_relu, num_filter=512,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn3b3_branch2c = mx.symbol.BatchNorm(name='bn3b3_branch2c', data=res3b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale3b3_branch2c = bn3b3_branch2c
+        res3b3 = mx.symbol.broadcast_add(name='res3b3', *[res3b2_relu, scale3b3_branch2c])
+        res3b3_relu = mx.symbol.Activation(name='res3b3_relu', data=res3b3, act_type='relu')
+        res4a_branch1 = mx.symbol.Convolution(name='res4a_branch1', data=res3b3_relu, num_filter=1024, pad=(0, 0),
+                                              kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch1 = mx.symbol.BatchNorm(name='bn4a_branch1', data=res4a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale4a_branch1 = bn4a_branch1
+        res4a_branch2a = mx.symbol.Convolution(name='res4a_branch2a', data=res3b3_relu, num_filter=256, pad=(0, 0),
+                                               kernel=(1, 1), stride=(2, 2), no_bias=True)
+        bn4a_branch2a = mx.symbol.BatchNorm(name='bn4a_branch2a', data=res4a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2a = bn4a_branch2a
+        res4a_branch2a_relu = mx.symbol.Activation(name='res4a_branch2a_relu', data=scale4a_branch2a, act_type='relu')
+        res4a_branch2b = mx.symbol.Convolution(name='res4a_branch2b', data=res4a_branch2a_relu, num_filter=256, pad=(1, 1),
+                                               kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4a_branch2b = mx.symbol.BatchNorm(name='bn4a_branch2b', data=res4a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2b = bn4a_branch2b
+        res4a_branch2b_relu = mx.symbol.Activation(name='res4a_branch2b_relu', data=scale4a_branch2b, act_type='relu')
+        res4a_branch2c = mx.symbol.Convolution(name='res4a_branch2c', data=res4a_branch2b_relu, num_filter=1024, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4a_branch2c = mx.symbol.BatchNorm(name='bn4a_branch2c', data=res4a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale4a_branch2c = bn4a_branch2c
+        res4a = mx.symbol.broadcast_add(name='res4a', *[scale4a_branch1, scale4a_branch2c])
+        res4a_relu = mx.symbol.Activation(name='res4a_relu', data=res4a, act_type='relu')
+        res4b1_branch2a = mx.symbol.Convolution(name='res4b1_branch2a', data=res4a_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2a = mx.symbol.BatchNorm(name='bn4b1_branch2a', data=res4b1_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2a = bn4b1_branch2a
+        res4b1_branch2a_relu = mx.symbol.Activation(name='res4b1_branch2a_relu', data=scale4b1_branch2a, act_type='relu')
+        res4b1_branch2b = mx.symbol.Convolution(name='res4b1_branch2b', data=res4b1_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b1_branch2b = mx.symbol.BatchNorm(name='bn4b1_branch2b', data=res4b1_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2b = bn4b1_branch2b
+        res4b1_branch2b_relu = mx.symbol.Activation(name='res4b1_branch2b_relu', data=scale4b1_branch2b, act_type='relu')
+        res4b1_branch2c = mx.symbol.Convolution(name='res4b1_branch2c', data=res4b1_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b1_branch2c = mx.symbol.BatchNorm(name='bn4b1_branch2c', data=res4b1_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b1_branch2c = bn4b1_branch2c
+        res4b1 = mx.symbol.broadcast_add(name='res4b1', *[res4a_relu, scale4b1_branch2c])
+        res4b1_relu = mx.symbol.Activation(name='res4b1_relu', data=res4b1, act_type='relu')
+        res4b2_branch2a = mx.symbol.Convolution(name='res4b2_branch2a', data=res4b1_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2a = mx.symbol.BatchNorm(name='bn4b2_branch2a', data=res4b2_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2a = bn4b2_branch2a
+        res4b2_branch2a_relu = mx.symbol.Activation(name='res4b2_branch2a_relu', data=scale4b2_branch2a, act_type='relu')
+        res4b2_branch2b = mx.symbol.Convolution(name='res4b2_branch2b', data=res4b2_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b2_branch2b = mx.symbol.BatchNorm(name='bn4b2_branch2b', data=res4b2_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2b = bn4b2_branch2b
+        res4b2_branch2b_relu = mx.symbol.Activation(name='res4b2_branch2b_relu', data=scale4b2_branch2b, act_type='relu')
+        res4b2_branch2c = mx.symbol.Convolution(name='res4b2_branch2c', data=res4b2_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b2_branch2c = mx.symbol.BatchNorm(name='bn4b2_branch2c', data=res4b2_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b2_branch2c = bn4b2_branch2c
+        res4b2 = mx.symbol.broadcast_add(name='res4b2', *[res4b1_relu, scale4b2_branch2c])
+        res4b2_relu = mx.symbol.Activation(name='res4b2_relu', data=res4b2, act_type='relu')
+        res4b3_branch2a = mx.symbol.Convolution(name='res4b3_branch2a', data=res4b2_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2a = mx.symbol.BatchNorm(name='bn4b3_branch2a', data=res4b3_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2a = bn4b3_branch2a
+        res4b3_branch2a_relu = mx.symbol.Activation(name='res4b3_branch2a_relu', data=scale4b3_branch2a, act_type='relu')
+        res4b3_branch2b = mx.symbol.Convolution(name='res4b3_branch2b', data=res4b3_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b3_branch2b = mx.symbol.BatchNorm(name='bn4b3_branch2b', data=res4b3_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2b = bn4b3_branch2b
+        res4b3_branch2b_relu = mx.symbol.Activation(name='res4b3_branch2b_relu', data=scale4b3_branch2b, act_type='relu')
+        res4b3_branch2c = mx.symbol.Convolution(name='res4b3_branch2c', data=res4b3_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b3_branch2c = mx.symbol.BatchNorm(name='bn4b3_branch2c', data=res4b3_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b3_branch2c = bn4b3_branch2c
+        res4b3 = mx.symbol.broadcast_add(name='res4b3', *[res4b2_relu, scale4b3_branch2c])
+        res4b3_relu = mx.symbol.Activation(name='res4b3_relu', data=res4b3, act_type='relu')
+        res4b4_branch2a = mx.symbol.Convolution(name='res4b4_branch2a', data=res4b3_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2a = mx.symbol.BatchNorm(name='bn4b4_branch2a', data=res4b4_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2a = bn4b4_branch2a
+        res4b4_branch2a_relu = mx.symbol.Activation(name='res4b4_branch2a_relu', data=scale4b4_branch2a, act_type='relu')
+        res4b4_branch2b = mx.symbol.Convolution(name='res4b4_branch2b', data=res4b4_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b4_branch2b = mx.symbol.BatchNorm(name='bn4b4_branch2b', data=res4b4_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2b = bn4b4_branch2b
+        res4b4_branch2b_relu = mx.symbol.Activation(name='res4b4_branch2b_relu', data=scale4b4_branch2b, act_type='relu')
+        res4b4_branch2c = mx.symbol.Convolution(name='res4b4_branch2c', data=res4b4_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b4_branch2c = mx.symbol.BatchNorm(name='bn4b4_branch2c', data=res4b4_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b4_branch2c = bn4b4_branch2c
+        res4b4 = mx.symbol.broadcast_add(name='res4b4', *[res4b3_relu, scale4b4_branch2c])
+        res4b4_relu = mx.symbol.Activation(name='res4b4_relu', data=res4b4, act_type='relu')
+        res4b5_branch2a = mx.symbol.Convolution(name='res4b5_branch2a', data=res4b4_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2a = mx.symbol.BatchNorm(name='bn4b5_branch2a', data=res4b5_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2a = bn4b5_branch2a
+        res4b5_branch2a_relu = mx.symbol.Activation(name='res4b5_branch2a_relu', data=scale4b5_branch2a, act_type='relu')
+        res4b5_branch2b = mx.symbol.Convolution(name='res4b5_branch2b', data=res4b5_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b5_branch2b = mx.symbol.BatchNorm(name='bn4b5_branch2b', data=res4b5_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2b = bn4b5_branch2b
+        res4b5_branch2b_relu = mx.symbol.Activation(name='res4b5_branch2b_relu', data=scale4b5_branch2b, act_type='relu')
+        res4b5_branch2c = mx.symbol.Convolution(name='res4b5_branch2c', data=res4b5_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b5_branch2c = mx.symbol.BatchNorm(name='bn4b5_branch2c', data=res4b5_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b5_branch2c = bn4b5_branch2c
+        res4b5 = mx.symbol.broadcast_add(name='res4b5', *[res4b4_relu, scale4b5_branch2c])
+        res4b5_relu = mx.symbol.Activation(name='res4b5_relu', data=res4b5, act_type='relu')
+        res4b6_branch2a = mx.symbol.Convolution(name='res4b6_branch2a', data=res4b5_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2a = mx.symbol.BatchNorm(name='bn4b6_branch2a', data=res4b6_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2a = bn4b6_branch2a
+        res4b6_branch2a_relu = mx.symbol.Activation(name='res4b6_branch2a_relu', data=scale4b6_branch2a, act_type='relu')
+        res4b6_branch2b = mx.symbol.Convolution(name='res4b6_branch2b', data=res4b6_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b6_branch2b = mx.symbol.BatchNorm(name='bn4b6_branch2b', data=res4b6_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2b = bn4b6_branch2b
+        res4b6_branch2b_relu = mx.symbol.Activation(name='res4b6_branch2b_relu', data=scale4b6_branch2b, act_type='relu')
+        res4b6_branch2c = mx.symbol.Convolution(name='res4b6_branch2c', data=res4b6_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b6_branch2c = mx.symbol.BatchNorm(name='bn4b6_branch2c', data=res4b6_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b6_branch2c = bn4b6_branch2c
+        res4b6 = mx.symbol.broadcast_add(name='res4b6', *[res4b5_relu, scale4b6_branch2c])
+        res4b6_relu = mx.symbol.Activation(name='res4b6_relu', data=res4b6, act_type='relu')
+        res4b7_branch2a = mx.symbol.Convolution(name='res4b7_branch2a', data=res4b6_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2a = mx.symbol.BatchNorm(name='bn4b7_branch2a', data=res4b7_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2a = bn4b7_branch2a
+        res4b7_branch2a_relu = mx.symbol.Activation(name='res4b7_branch2a_relu', data=scale4b7_branch2a, act_type='relu')
+        res4b7_branch2b = mx.symbol.Convolution(name='res4b7_branch2b', data=res4b7_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b7_branch2b = mx.symbol.BatchNorm(name='bn4b7_branch2b', data=res4b7_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2b = bn4b7_branch2b
+        res4b7_branch2b_relu = mx.symbol.Activation(name='res4b7_branch2b_relu', data=scale4b7_branch2b, act_type='relu')
+        res4b7_branch2c = mx.symbol.Convolution(name='res4b7_branch2c', data=res4b7_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b7_branch2c = mx.symbol.BatchNorm(name='bn4b7_branch2c', data=res4b7_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b7_branch2c = bn4b7_branch2c
+        res4b7 = mx.symbol.broadcast_add(name='res4b7', *[res4b6_relu, scale4b7_branch2c])
+        res4b7_relu = mx.symbol.Activation(name='res4b7_relu', data=res4b7, act_type='relu')
+        res4b8_branch2a = mx.symbol.Convolution(name='res4b8_branch2a', data=res4b7_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2a = mx.symbol.BatchNorm(name='bn4b8_branch2a', data=res4b8_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2a = bn4b8_branch2a
+        res4b8_branch2a_relu = mx.symbol.Activation(name='res4b8_branch2a_relu', data=scale4b8_branch2a, act_type='relu')
+        res4b8_branch2b = mx.symbol.Convolution(name='res4b8_branch2b', data=res4b8_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b8_branch2b = mx.symbol.BatchNorm(name='bn4b8_branch2b', data=res4b8_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2b = bn4b8_branch2b
+        res4b8_branch2b_relu = mx.symbol.Activation(name='res4b8_branch2b_relu', data=scale4b8_branch2b, act_type='relu')
+        res4b8_branch2c = mx.symbol.Convolution(name='res4b8_branch2c', data=res4b8_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b8_branch2c = mx.symbol.BatchNorm(name='bn4b8_branch2c', data=res4b8_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b8_branch2c = bn4b8_branch2c
+        res4b8 = mx.symbol.broadcast_add(name='res4b8', *[res4b7_relu, scale4b8_branch2c])
+        res4b8_relu = mx.symbol.Activation(name='res4b8_relu', data=res4b8, act_type='relu')
+        res4b9_branch2a = mx.symbol.Convolution(name='res4b9_branch2a', data=res4b8_relu, num_filter=256, pad=(0, 0),
+                                                kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2a = mx.symbol.BatchNorm(name='bn4b9_branch2a', data=res4b9_branch2a, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2a = bn4b9_branch2a
+        res4b9_branch2a_relu = mx.symbol.Activation(name='res4b9_branch2a_relu', data=scale4b9_branch2a, act_type='relu')
+        res4b9_branch2b = mx.symbol.Convolution(name='res4b9_branch2b', data=res4b9_branch2a_relu, num_filter=256,
+                                                pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b9_branch2b = mx.symbol.BatchNorm(name='bn4b9_branch2b', data=res4b9_branch2b, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2b = bn4b9_branch2b
+        res4b9_branch2b_relu = mx.symbol.Activation(name='res4b9_branch2b_relu', data=scale4b9_branch2b, act_type='relu')
+        res4b9_branch2c = mx.symbol.Convolution(name='res4b9_branch2c', data=res4b9_branch2b_relu, num_filter=1024,
+                                                pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b9_branch2c = mx.symbol.BatchNorm(name='bn4b9_branch2c', data=res4b9_branch2c, use_global_stats=True,
+                                             fix_gamma=False, eps=self.eps)
+        scale4b9_branch2c = bn4b9_branch2c
+        res4b9 = mx.symbol.broadcast_add(name='res4b9', *[res4b8_relu, scale4b9_branch2c])
+        res4b9_relu = mx.symbol.Activation(name='res4b9_relu', data=res4b9, act_type='relu')
+        res4b10_branch2a = mx.symbol.Convolution(name='res4b10_branch2a', data=res4b9_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2a = mx.symbol.BatchNorm(name='bn4b10_branch2a', data=res4b10_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2a = bn4b10_branch2a
+        res4b10_branch2a_relu = mx.symbol.Activation(name='res4b10_branch2a_relu', data=scale4b10_branch2a, act_type='relu')
+        res4b10_branch2b = mx.symbol.Convolution(name='res4b10_branch2b', data=res4b10_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b10_branch2b = mx.symbol.BatchNorm(name='bn4b10_branch2b', data=res4b10_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2b = bn4b10_branch2b
+        res4b10_branch2b_relu = mx.symbol.Activation(name='res4b10_branch2b_relu', data=scale4b10_branch2b, act_type='relu')
+        res4b10_branch2c = mx.symbol.Convolution(name='res4b10_branch2c', data=res4b10_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b10_branch2c = mx.symbol.BatchNorm(name='bn4b10_branch2c', data=res4b10_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b10_branch2c = bn4b10_branch2c
+        res4b10 = mx.symbol.broadcast_add(name='res4b10', *[res4b9_relu, scale4b10_branch2c])
+        res4b10_relu = mx.symbol.Activation(name='res4b10_relu', data=res4b10, act_type='relu')
+        res4b11_branch2a = mx.symbol.Convolution(name='res4b11_branch2a', data=res4b10_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2a = mx.symbol.BatchNorm(name='bn4b11_branch2a', data=res4b11_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2a = bn4b11_branch2a
+        res4b11_branch2a_relu = mx.symbol.Activation(name='res4b11_branch2a_relu', data=scale4b11_branch2a, act_type='relu')
+        res4b11_branch2b = mx.symbol.Convolution(name='res4b11_branch2b', data=res4b11_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b11_branch2b = mx.symbol.BatchNorm(name='bn4b11_branch2b', data=res4b11_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2b = bn4b11_branch2b
+        res4b11_branch2b_relu = mx.symbol.Activation(name='res4b11_branch2b_relu', data=scale4b11_branch2b, act_type='relu')
+        res4b11_branch2c = mx.symbol.Convolution(name='res4b11_branch2c', data=res4b11_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b11_branch2c = mx.symbol.BatchNorm(name='bn4b11_branch2c', data=res4b11_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b11_branch2c = bn4b11_branch2c
+        res4b11 = mx.symbol.broadcast_add(name='res4b11', *[res4b10_relu, scale4b11_branch2c])
+        res4b11_relu = mx.symbol.Activation(name='res4b11_relu', data=res4b11, act_type='relu')
+        res4b12_branch2a = mx.symbol.Convolution(name='res4b12_branch2a', data=res4b11_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2a = mx.symbol.BatchNorm(name='bn4b12_branch2a', data=res4b12_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2a = bn4b12_branch2a
+        res4b12_branch2a_relu = mx.symbol.Activation(name='res4b12_branch2a_relu', data=scale4b12_branch2a, act_type='relu')
+        res4b12_branch2b = mx.symbol.Convolution(name='res4b12_branch2b', data=res4b12_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b12_branch2b = mx.symbol.BatchNorm(name='bn4b12_branch2b', data=res4b12_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2b = bn4b12_branch2b
+        res4b12_branch2b_relu = mx.symbol.Activation(name='res4b12_branch2b_relu', data=scale4b12_branch2b, act_type='relu')
+        res4b12_branch2c = mx.symbol.Convolution(name='res4b12_branch2c', data=res4b12_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b12_branch2c = mx.symbol.BatchNorm(name='bn4b12_branch2c', data=res4b12_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b12_branch2c = bn4b12_branch2c
+        res4b12 = mx.symbol.broadcast_add(name='res4b12', *[res4b11_relu, scale4b12_branch2c])
+        res4b12_relu = mx.symbol.Activation(name='res4b12_relu', data=res4b12, act_type='relu')
+        res4b13_branch2a = mx.symbol.Convolution(name='res4b13_branch2a', data=res4b12_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2a = mx.symbol.BatchNorm(name='bn4b13_branch2a', data=res4b13_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2a = bn4b13_branch2a
+        res4b13_branch2a_relu = mx.symbol.Activation(name='res4b13_branch2a_relu', data=scale4b13_branch2a, act_type='relu')
+        res4b13_branch2b = mx.symbol.Convolution(name='res4b13_branch2b', data=res4b13_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b13_branch2b = mx.symbol.BatchNorm(name='bn4b13_branch2b', data=res4b13_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2b = bn4b13_branch2b
+        res4b13_branch2b_relu = mx.symbol.Activation(name='res4b13_branch2b_relu', data=scale4b13_branch2b, act_type='relu')
+        res4b13_branch2c = mx.symbol.Convolution(name='res4b13_branch2c', data=res4b13_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b13_branch2c = mx.symbol.BatchNorm(name='bn4b13_branch2c', data=res4b13_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b13_branch2c = bn4b13_branch2c
+        res4b13 = mx.symbol.broadcast_add(name='res4b13', *[res4b12_relu, scale4b13_branch2c])
+        res4b13_relu = mx.symbol.Activation(name='res4b13_relu', data=res4b13, act_type='relu')
+        res4b14_branch2a = mx.symbol.Convolution(name='res4b14_branch2a', data=res4b13_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2a = mx.symbol.BatchNorm(name='bn4b14_branch2a', data=res4b14_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2a = bn4b14_branch2a
+        res4b14_branch2a_relu = mx.symbol.Activation(name='res4b14_branch2a_relu', data=scale4b14_branch2a, act_type='relu')
+        res4b14_branch2b = mx.symbol.Convolution(name='res4b14_branch2b', data=res4b14_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b14_branch2b = mx.symbol.BatchNorm(name='bn4b14_branch2b', data=res4b14_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2b = bn4b14_branch2b
+        res4b14_branch2b_relu = mx.symbol.Activation(name='res4b14_branch2b_relu', data=scale4b14_branch2b, act_type='relu')
+        res4b14_branch2c = mx.symbol.Convolution(name='res4b14_branch2c', data=res4b14_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b14_branch2c = mx.symbol.BatchNorm(name='bn4b14_branch2c', data=res4b14_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b14_branch2c = bn4b14_branch2c
+        res4b14 = mx.symbol.broadcast_add(name='res4b14', *[res4b13_relu, scale4b14_branch2c])
+        res4b14_relu = mx.symbol.Activation(name='res4b14_relu', data=res4b14, act_type='relu')
+        res4b15_branch2a = mx.symbol.Convolution(name='res4b15_branch2a', data=res4b14_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2a = mx.symbol.BatchNorm(name='bn4b15_branch2a', data=res4b15_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2a = bn4b15_branch2a
+        res4b15_branch2a_relu = mx.symbol.Activation(name='res4b15_branch2a_relu', data=scale4b15_branch2a, act_type='relu')
+        res4b15_branch2b = mx.symbol.Convolution(name='res4b15_branch2b', data=res4b15_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b15_branch2b = mx.symbol.BatchNorm(name='bn4b15_branch2b', data=res4b15_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2b = bn4b15_branch2b
+        res4b15_branch2b_relu = mx.symbol.Activation(name='res4b15_branch2b_relu', data=scale4b15_branch2b, act_type='relu')
+        res4b15_branch2c = mx.symbol.Convolution(name='res4b15_branch2c', data=res4b15_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b15_branch2c = mx.symbol.BatchNorm(name='bn4b15_branch2c', data=res4b15_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b15_branch2c = bn4b15_branch2c
+        res4b15 = mx.symbol.broadcast_add(name='res4b15', *[res4b14_relu, scale4b15_branch2c])
+        res4b15_relu = mx.symbol.Activation(name='res4b15_relu', data=res4b15, act_type='relu')
+        res4b16_branch2a = mx.symbol.Convolution(name='res4b16_branch2a', data=res4b15_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2a = mx.symbol.BatchNorm(name='bn4b16_branch2a', data=res4b16_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2a = bn4b16_branch2a
+        res4b16_branch2a_relu = mx.symbol.Activation(name='res4b16_branch2a_relu', data=scale4b16_branch2a, act_type='relu')
+        res4b16_branch2b = mx.symbol.Convolution(name='res4b16_branch2b', data=res4b16_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b16_branch2b = mx.symbol.BatchNorm(name='bn4b16_branch2b', data=res4b16_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2b = bn4b16_branch2b
+        res4b16_branch2b_relu = mx.symbol.Activation(name='res4b16_branch2b_relu', data=scale4b16_branch2b, act_type='relu')
+        res4b16_branch2c = mx.symbol.Convolution(name='res4b16_branch2c', data=res4b16_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b16_branch2c = mx.symbol.BatchNorm(name='bn4b16_branch2c', data=res4b16_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b16_branch2c = bn4b16_branch2c
+        res4b16 = mx.symbol.broadcast_add(name='res4b16', *[res4b15_relu, scale4b16_branch2c])
+        res4b16_relu = mx.symbol.Activation(name='res4b16_relu', data=res4b16, act_type='relu')
+        res4b17_branch2a = mx.symbol.Convolution(name='res4b17_branch2a', data=res4b16_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2a = mx.symbol.BatchNorm(name='bn4b17_branch2a', data=res4b17_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2a = bn4b17_branch2a
+        res4b17_branch2a_relu = mx.symbol.Activation(name='res4b17_branch2a_relu', data=scale4b17_branch2a, act_type='relu')
+        res4b17_branch2b = mx.symbol.Convolution(name='res4b17_branch2b', data=res4b17_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b17_branch2b = mx.symbol.BatchNorm(name='bn4b17_branch2b', data=res4b17_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2b = bn4b17_branch2b
+        res4b17_branch2b_relu = mx.symbol.Activation(name='res4b17_branch2b_relu', data=scale4b17_branch2b, act_type='relu')
+        res4b17_branch2c = mx.symbol.Convolution(name='res4b17_branch2c', data=res4b17_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b17_branch2c = mx.symbol.BatchNorm(name='bn4b17_branch2c', data=res4b17_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b17_branch2c = bn4b17_branch2c
+        res4b17 = mx.symbol.broadcast_add(name='res4b17', *[res4b16_relu, scale4b17_branch2c])
+        res4b17_relu = mx.symbol.Activation(name='res4b17_relu', data=res4b17, act_type='relu')
+        res4b18_branch2a = mx.symbol.Convolution(name='res4b18_branch2a', data=res4b17_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2a = mx.symbol.BatchNorm(name='bn4b18_branch2a', data=res4b18_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2a = bn4b18_branch2a
+        res4b18_branch2a_relu = mx.symbol.Activation(name='res4b18_branch2a_relu', data=scale4b18_branch2a, act_type='relu')
+        res4b18_branch2b = mx.symbol.Convolution(name='res4b18_branch2b', data=res4b18_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b18_branch2b = mx.symbol.BatchNorm(name='bn4b18_branch2b', data=res4b18_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2b = bn4b18_branch2b
+        res4b18_branch2b_relu = mx.symbol.Activation(name='res4b18_branch2b_relu', data=scale4b18_branch2b, act_type='relu')
+        res4b18_branch2c = mx.symbol.Convolution(name='res4b18_branch2c', data=res4b18_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b18_branch2c = mx.symbol.BatchNorm(name='bn4b18_branch2c', data=res4b18_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b18_branch2c = bn4b18_branch2c
+        res4b18 = mx.symbol.broadcast_add(name='res4b18', *[res4b17_relu, scale4b18_branch2c])
+        res4b18_relu = mx.symbol.Activation(name='res4b18_relu', data=res4b18, act_type='relu')
+        res4b19_branch2a = mx.symbol.Convolution(name='res4b19_branch2a', data=res4b18_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2a = mx.symbol.BatchNorm(name='bn4b19_branch2a', data=res4b19_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2a = bn4b19_branch2a
+        res4b19_branch2a_relu = mx.symbol.Activation(name='res4b19_branch2a_relu', data=scale4b19_branch2a, act_type='relu')
+        res4b19_branch2b = mx.symbol.Convolution(name='res4b19_branch2b', data=res4b19_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b19_branch2b = mx.symbol.BatchNorm(name='bn4b19_branch2b', data=res4b19_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2b = bn4b19_branch2b
+        res4b19_branch2b_relu = mx.symbol.Activation(name='res4b19_branch2b_relu', data=scale4b19_branch2b, act_type='relu')
+        res4b19_branch2c = mx.symbol.Convolution(name='res4b19_branch2c', data=res4b19_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b19_branch2c = mx.symbol.BatchNorm(name='bn4b19_branch2c', data=res4b19_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b19_branch2c = bn4b19_branch2c
+        res4b19 = mx.symbol.broadcast_add(name='res4b19', *[res4b18_relu, scale4b19_branch2c])
+        res4b19_relu = mx.symbol.Activation(name='res4b19_relu', data=res4b19, act_type='relu')
+        res4b20_branch2a = mx.symbol.Convolution(name='res4b20_branch2a', data=res4b19_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2a = mx.symbol.BatchNorm(name='bn4b20_branch2a', data=res4b20_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2a = bn4b20_branch2a
+        res4b20_branch2a_relu = mx.symbol.Activation(name='res4b20_branch2a_relu', data=scale4b20_branch2a, act_type='relu')
+        res4b20_branch2b = mx.symbol.Convolution(name='res4b20_branch2b', data=res4b20_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b20_branch2b = mx.symbol.BatchNorm(name='bn4b20_branch2b', data=res4b20_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2b = bn4b20_branch2b
+        res4b20_branch2b_relu = mx.symbol.Activation(name='res4b20_branch2b_relu', data=scale4b20_branch2b, act_type='relu')
+        res4b20_branch2c = mx.symbol.Convolution(name='res4b20_branch2c', data=res4b20_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b20_branch2c = mx.symbol.BatchNorm(name='bn4b20_branch2c', data=res4b20_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b20_branch2c = bn4b20_branch2c
+        res4b20 = mx.symbol.broadcast_add(name='res4b20', *[res4b19_relu, scale4b20_branch2c])
+        res4b20_relu = mx.symbol.Activation(name='res4b20_relu', data=res4b20, act_type='relu')
+        res4b21_branch2a = mx.symbol.Convolution(name='res4b21_branch2a', data=res4b20_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2a = mx.symbol.BatchNorm(name='bn4b21_branch2a', data=res4b21_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2a = bn4b21_branch2a
+        res4b21_branch2a_relu = mx.symbol.Activation(name='res4b21_branch2a_relu', data=scale4b21_branch2a, act_type='relu')
+        res4b21_branch2b = mx.symbol.Convolution(name='res4b21_branch2b', data=res4b21_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b21_branch2b = mx.symbol.BatchNorm(name='bn4b21_branch2b', data=res4b21_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2b = bn4b21_branch2b
+        res4b21_branch2b_relu = mx.symbol.Activation(name='res4b21_branch2b_relu', data=scale4b21_branch2b, act_type='relu')
+        res4b21_branch2c = mx.symbol.Convolution(name='res4b21_branch2c', data=res4b21_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b21_branch2c = mx.symbol.BatchNorm(name='bn4b21_branch2c', data=res4b21_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b21_branch2c = bn4b21_branch2c
+        res4b21 = mx.symbol.broadcast_add(name='res4b21', *[res4b20_relu, scale4b21_branch2c])
+        res4b21_relu = mx.symbol.Activation(name='res4b21_relu', data=res4b21, act_type='relu')
+        res4b22_branch2a = mx.symbol.Convolution(name='res4b22_branch2a', data=res4b21_relu, num_filter=256, pad=(0, 0),
+                                                 kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2a = mx.symbol.BatchNorm(name='bn4b22_branch2a', data=res4b22_branch2a, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2a = bn4b22_branch2a
+        res4b22_branch2a_relu = mx.symbol.Activation(name='res4b22_branch2a_relu', data=scale4b22_branch2a, act_type='relu')
+        res4b22_branch2b = mx.symbol.Convolution(name='res4b22_branch2b', data=res4b22_branch2a_relu, num_filter=256,
+                                                 pad=(1, 1), kernel=(3, 3), stride=(1, 1), no_bias=True)
+        bn4b22_branch2b = mx.symbol.BatchNorm(name='bn4b22_branch2b', data=res4b22_branch2b, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2b = bn4b22_branch2b
+        res4b22_branch2b_relu = mx.symbol.Activation(name='res4b22_branch2b_relu', data=scale4b22_branch2b, act_type='relu')
+        res4b22_branch2c = mx.symbol.Convolution(name='res4b22_branch2c', data=res4b22_branch2b_relu, num_filter=1024,
+                                                 pad=(0, 0), kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn4b22_branch2c = mx.symbol.BatchNorm(name='bn4b22_branch2c', data=res4b22_branch2c, use_global_stats=True,
+                                              fix_gamma=False, eps=self.eps)
+        scale4b22_branch2c = bn4b22_branch2c
+        res4b22 = mx.symbol.broadcast_add(name='res4b22', *[res4b21_relu, scale4b22_branch2c])
+        res4b22_relu = mx.symbol.Activation(name='res4b22_relu', data=res4b22, act_type='relu')
+        return res4b22_relu
+        
+    def get_resnet_v1_conv5(self, conv_feat):
+        res5a_branch1 = mx.symbol.Convolution(name='res5a_branch1', data=conv_feat, num_filter=2048, pad=(0, 0),
+                                              kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch1 = mx.symbol.BatchNorm(name='bn5a_branch1', data=res5a_branch1, use_global_stats=True, fix_gamma=False, eps=self.eps)
+        scale5a_branch1 = bn5a_branch1
+        res5a_branch2a = mx.symbol.Convolution(name='res5a_branch2a', data=conv_feat, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2a = mx.symbol.BatchNorm(name='bn5a_branch2a', data=res5a_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2a = bn5a_branch2a
+        res5a_branch2a_relu = mx.symbol.Activation(name='res5a_branch2a_relu', data=scale5a_branch2a, act_type='relu')
+        res5a_branch2b = mx.symbol.Convolution(name='res5a_branch2b', data=res5a_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True, cudnn_off=True)
+        bn5a_branch2b = mx.symbol.BatchNorm(name='bn5a_branch2b', data=res5a_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2b = bn5a_branch2b
+        res5a_branch2b_relu = mx.symbol.Activation(name='res5a_branch2b_relu', data=scale5a_branch2b, act_type='relu')
+        res5a_branch2c = mx.symbol.Convolution(name='res5a_branch2c', data=res5a_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5a_branch2c = mx.symbol.BatchNorm(name='bn5a_branch2c', data=res5a_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5a_branch2c = bn5a_branch2c
+        res5a = mx.symbol.broadcast_add(name='res5a', *[scale5a_branch1, scale5a_branch2c])
+        res5a_relu = mx.symbol.Activation(name='res5a_relu', data=res5a, act_type='relu')
+        res5b_branch2a = mx.symbol.Convolution(name='res5b_branch2a', data=res5a_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2a = mx.symbol.BatchNorm(name='bn5b_branch2a', data=res5b_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2a = bn5b_branch2a
+        res5b_branch2a_relu = mx.symbol.Activation(name='res5b_branch2a_relu', data=scale5b_branch2a, act_type='relu')
+        res5b_branch2b = mx.symbol.Convolution(name='res5b_branch2b', data=res5b_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True, cudnn_off=True)
+        bn5b_branch2b = mx.symbol.BatchNorm(name='bn5b_branch2b', data=res5b_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2b = bn5b_branch2b
+        res5b_branch2b_relu = mx.symbol.Activation(name='res5b_branch2b_relu', data=scale5b_branch2b, act_type='relu')
+        res5b_branch2c = mx.symbol.Convolution(name='res5b_branch2c', data=res5b_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5b_branch2c = mx.symbol.BatchNorm(name='bn5b_branch2c', data=res5b_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5b_branch2c = bn5b_branch2c
+        res5b = mx.symbol.broadcast_add(name='res5b', *[res5a_relu, scale5b_branch2c])
+        res5b_relu = mx.symbol.Activation(name='res5b_relu', data=res5b, act_type='relu')
+        res5c_branch2a = mx.symbol.Convolution(name='res5c_branch2a', data=res5b_relu, num_filter=512, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2a = mx.symbol.BatchNorm(name='bn5c_branch2a', data=res5c_branch2a, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2a = bn5c_branch2a
+        res5c_branch2a_relu = mx.symbol.Activation(name='res5c_branch2a_relu', data=scale5c_branch2a, act_type='relu')
+        res5c_branch2b = mx.symbol.Convolution(name='res5c_branch2b', data=res5c_branch2a_relu, num_filter=512, pad=(2, 2),
+                                               kernel=(3, 3), stride=(1, 1), dilate=(2, 2), no_bias=True, cudnn_off=True)
+        bn5c_branch2b = mx.symbol.BatchNorm(name='bn5c_branch2b', data=res5c_branch2b, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2b = bn5c_branch2b
+        res5c_branch2b_relu = mx.symbol.Activation(name='res5c_branch2b_relu', data=scale5c_branch2b, act_type='relu')
+        res5c_branch2c = mx.symbol.Convolution(name='res5c_branch2c', data=res5c_branch2b_relu, num_filter=2048, pad=(0, 0),
+                                               kernel=(1, 1), stride=(1, 1), no_bias=True)
+        bn5c_branch2c = mx.symbol.BatchNorm(name='bn5c_branch2c', data=res5c_branch2c, use_global_stats=True,
+                                            fix_gamma=False, eps=self.eps)
+        scale5c_branch2c = bn5c_branch2c
+        res5c = mx.symbol.broadcast_add(name='res5c', *[res5b_relu, scale5c_branch2c])
+        res5c_relu = mx.symbol.Activation(name='res5c_relu', data=res5c, act_type='relu')
+        return res5c_relu
+
+    def get_rpn(self, conv_feat, num_anchors):
+        rpn_conv = mx.sym.Convolution(
+            data=conv_feat, kernel=(3, 3), pad=(1, 1), num_filter=512, name="rpn_conv_3x3")
+        rpn_relu = mx.sym.Activation(data=rpn_conv, act_type="relu", name="rpn_relu")
+        rpn_cls_score = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=2 * num_anchors, name="rpn_cls_score")
+        rpn_bbox_pred = mx.sym.Convolution(
+            data=rpn_relu, kernel=(1, 1), pad=(0, 0), num_filter=4 * num_anchors, name="rpn_bbox_pred")
+        return rpn_cls_score, rpn_bbox_pred
+
+    def get_symbol(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+            gt_boxes = mx.sym.Variable(name="gt_boxes")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        # res5
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1, name="rpn_cls_prob")
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0, data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_, grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+
+            # ROI proposal
+            rpn_cls_act = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_act")
+            rpn_cls_act_reshape = mx.sym.Reshape(
+                data=rpn_cls_act, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_act_reshape')
+            if cfg.TRAIN.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_act_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TRAIN.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TRAIN.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TRAIN.RPN_NMS_THRESH, rpn_min_size=cfg.TRAIN.RPN_MIN_SIZE)
+            # ROI proposal target
+            gt_boxes_reshape = mx.sym.Reshape(data=gt_boxes, shape=(-1, 5), name='gt_boxes_reshape')
+            rois, label, bbox_target, bbox_weight = mx.sym.Custom(rois=rois, gt_boxes=gt_boxes_reshape,
+                                                                  op_type='proposal_target',
+                                                                  num_classes=num_reg_classes,
+                                                                  batch_images=cfg.TRAIN.BATCH_IMAGES,
+                                                                  batch_rois=cfg.TRAIN.BATCH_ROIS,
+                                                                  cfg=cPickle.dumps(cfg),
+                                                                  fg_fraction=cfg.TRAIN.FG_FRACTION)
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois',
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+
+
+
+        # conv_new_1
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1", lr_mult=3.0)
+        relu_new_1 = mx.sym.Activation(data=conv_new_1, act_type='relu', name='relu1')
+
+        # rfcn_cls/rfcn_bbox
+        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3*3*num_classes, name="rfcn_cls")
+        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3*3*4*num_reg_classes, name="rfcn_bbox")
+        # trans_cls / trans_cls
+        rfcn_cls_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=2 * 3 * 3 * num_classes, name="rfcn_cls_offset_t")
+        rfcn_bbox_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3 * 3 * 2, name="rfcn_bbox_offset_t")
+
+        rfcn_cls_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_cls_offset', data=rfcn_cls_offset_t, rois=rois, group_size=3, pooled_size=3,
+                                                                sample_per_part=4, no_trans=True, part_size=3, output_dim=2 * num_classes, spatial_scale=0.0625)
+        rfcn_bbox_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_bbox_offset', data=rfcn_bbox_offset_t, rois=rois, group_size=3, pooled_size=3,
+                                                                 sample_per_part=4, no_trans=True, part_size=3, output_dim=2, spatial_scale=0.0625)
+
+        psroipooled_cls_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls, rois=rois, trans=rfcn_cls_offset,
+                                                                     group_size=3, pooled_size=3, sample_per_part=4, no_trans=False, trans_std=0.1,
+                                                                     output_dim=num_classes, spatial_scale=0.0625, part_size=3)
+        psroipooled_loc_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox, rois=rois, trans=rfcn_bbox_offset,
+                                                                     group_size=3, pooled_size=3, sample_per_part=4, no_trans=False, trans_std=0.1,
+                                                                     output_dim=8, spatial_scale=0.0625, part_size=3)
+        cls_score = mx.sym.Pooling(name='ave_cls_scors_rois', data=psroipooled_cls_rois, pool_type='avg', global_pool=True, kernel=(3, 3))
+        bbox_pred = mx.sym.Pooling(name='ave_bbox_pred_rois', data=psroipooled_loc_rois, pool_type='avg', global_pool=True, kernel=(3, 3))
+        cls_score = mx.sym.Reshape(name='cls_score_reshape', data=cls_score, shape=(-1, num_classes))
+        bbox_pred = mx.sym.Reshape(name='bbox_pred_reshape', data=bbox_pred, shape=(-1, 4 * num_reg_classes))
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes, roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem, normalization='valid', use_ignore=True, ignore_label=-1)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                rcnn_label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid')
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+                rcnn_label = label
+
+            # reshape output
+            rcnn_label = mx.sym.Reshape(data=rcnn_label, shape=(cfg.TRAIN.BATCH_IMAGES, -1), name='label_reshape')
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes), name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes), name='bbox_loss_reshape')
+            group = mx.sym.Group([rpn_cls_prob, rpn_bbox_loss, cls_prob, bbox_loss, mx.sym.BlockGrad(rcnn_label)])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([rfcn_cls_offset, rois, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def get_symbol_rpn(self, cfg, is_train=True):
+        # config alias for convenient
+        num_anchors = cfg.network.NUM_ANCHORS
+
+        # input init
+        if is_train:
+            data = mx.sym.Variable(name="data")
+            rpn_label = mx.sym.Variable(name='label')
+            rpn_bbox_target = mx.sym.Variable(name='bbox_target')
+            rpn_bbox_weight = mx.sym.Variable(name='bbox_weight')
+        else:
+            data = mx.sym.Variable(name="data")
+            im_info = mx.sym.Variable(name="im_info")
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        rpn_cls_score, rpn_bbox_pred = self.get_rpn(conv_feat, num_anchors)
+        if is_train:
+            # prepare rpn data
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+
+            # classification
+            rpn_cls_prob = mx.sym.SoftmaxOutput(data=rpn_cls_score_reshape, label=rpn_label, multi_output=True,
+                                                normalization='valid', use_ignore=True, ignore_label=-1, name="rpn_cls_prob",
+                                                grad_scale=1.0)
+            # bounding box regression
+            rpn_bbox_loss_ = rpn_bbox_weight * mx.sym.smooth_l1(name='rpn_bbox_loss_', scalar=3.0, data=(rpn_bbox_pred - rpn_bbox_target))
+            rpn_bbox_loss = mx.sym.MakeLoss(name='rpn_bbox_loss', data=rpn_bbox_loss_, grad_scale=1.0 / cfg.TRAIN.RPN_BATCH_SIZE)
+            group = mx.symbol.Group([rpn_cls_prob, rpn_bbox_loss])
+        else:
+            # ROI Proposal
+            rpn_cls_score_reshape = mx.sym.Reshape(
+                data=rpn_cls_score, shape=(0, 2, -1, 0), name="rpn_cls_score_reshape")
+            rpn_cls_prob = mx.sym.SoftmaxActivation(
+                data=rpn_cls_score_reshape, mode="channel", name="rpn_cls_prob")
+            rpn_cls_prob_reshape = mx.sym.Reshape(
+                data=rpn_cls_prob, shape=(0, 2 * num_anchors, -1, 0), name='rpn_cls_prob_reshape')
+            if cfg.TEST.CXX_PROPOSAL:
+                rois, score = mx.contrib.sym.Proposal(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois', output_score=True,
+                    feature_stride=cfg.network.RPN_FEAT_STRIDE, scales=tuple(cfg.network.ANCHOR_SCALES),
+                    ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+            else:
+                rois, score = mx.sym.Custom(
+                    cls_prob=rpn_cls_prob_reshape, bbox_pred=rpn_bbox_pred, im_info=im_info, name='rois', output_score=True,
+                    op_type='proposal', feat_stride=cfg.network.RPN_FEAT_STRIDE,
+                    scales=tuple(cfg.network.ANCHOR_SCALES), ratios=tuple(cfg.network.ANCHOR_RATIOS),
+                    rpn_pre_nms_top_n=cfg.TEST.RPN_PRE_NMS_TOP_N, rpn_post_nms_top_n=cfg.TEST.RPN_POST_NMS_TOP_N,
+                    threshold=cfg.TEST.RPN_NMS_THRESH, rpn_min_size=cfg.TEST.RPN_MIN_SIZE)
+                group = mx.symbol.Group([rois, score])
+        self.sym = group
+        return group
+
+    def get_symbol_rfcn(self, cfg, is_train=True):
+
+        # config alias for convenient
+        num_classes = cfg.dataset.NUM_CLASSES
+        num_reg_classes = (2 if cfg.CLASS_AGNOSTIC else num_classes)
+
+        # input init
+        if is_train:
+            data = mx.symbol.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            label = mx.symbol.Variable(name='label')
+            bbox_target = mx.symbol.Variable(name='bbox_target')
+            bbox_weight = mx.symbol.Variable(name='bbox_weight')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+            label = mx.symbol.Reshape(data=label, shape=(-1,), name='label_reshape')
+            bbox_target = mx.symbol.Reshape(data=bbox_target, shape=(-1, 4 * num_reg_classes), name='bbox_target_reshape')
+            bbox_weight = mx.symbol.Reshape(data=bbox_weight, shape=(-1, 4 * num_reg_classes), name='bbox_weight_reshape')
+        else:
+            data = mx.sym.Variable(name="data")
+            rois = mx.symbol.Variable(name='rois')
+            # reshape input
+            rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
+
+        # shared convolutional layers
+        conv_feat = self.get_resnet_v1_conv4(data)
+        relu1 = self.get_resnet_v1_conv5(conv_feat)
+
+        # conv_new_1
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1", lr_mult=3.0)
+        relu_new_1 = mx.sym.Activation(data=conv_new_1, act_type='relu', name='relu1')
+
+        # rfcn_cls/rfcn_bbox
+        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3*3*num_classes, name="rfcn_cls")
+        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3*3*4*num_reg_classes, name="rfcn_bbox")
+        # trans_cls / trans_cls
+        rfcn_cls_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=2 * 3 * 3 * num_classes, name="rfcn_cls_offset_t")
+        rfcn_bbox_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=3 * 3 * 2, name="rfcn_bbox_offset_t")
+
+        rfcn_cls_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_cls_offset', data=rfcn_cls_offset_t, rois=rois, group_size=3, pooled_size=3,
+                                                                sample_per_part=4, no_trans=True, part_size=3, output_dim=2 * num_classes, spatial_scale=0.0625)
+        rfcn_bbox_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_bbox_offset', data=rfcn_bbox_offset_t, rois=rois, group_size=3, pooled_size=3,
+                                                                 sample_per_part=4, no_trans=True, part_size=3, output_dim=2, spatial_scale=0.0625)
+
+        psroipooled_cls_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls, rois=rois, trans=rfcn_cls_offset,
+                                                                     group_size=3, pooled_size=3, sample_per_part=4, no_trans=False, trans_std=0.1,
+                                                                     output_dim=num_classes, spatial_scale=0.0625, part_size=3)
+        psroipooled_loc_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox, rois=rois, trans=rfcn_bbox_offset,
+                                                                     group_size=3, pooled_size=3, sample_per_part=4, no_trans=False, trans_std=0.1,
+                                                                     output_dim=8, spatial_scale=0.0625, part_size=3)
+        cls_score = mx.sym.Pooling(name='ave_cls_scors_rois', data=psroipooled_cls_rois, pool_type='avg', global_pool=True, kernel=(3, 3))
+        bbox_pred = mx.sym.Pooling(name='ave_bbox_pred_rois', data=psroipooled_loc_rois, pool_type='avg', global_pool=True, kernel=(3, 3))
+        cls_score = mx.sym.Reshape(name='cls_score_reshape', data=cls_score, shape=(-1, num_classes))
+        bbox_pred = mx.sym.Reshape(name='bbox_pred_reshape', data=bbox_pred, shape=(-1, 4 * num_reg_classes))
+
+        if is_train:
+            if cfg.TRAIN.ENABLE_OHEM:
+                labels_ohem, bbox_weights_ohem = mx.sym.Custom(op_type='BoxAnnotatorOHEM', num_classes=num_classes,
+                                                               num_reg_classes=num_reg_classes, roi_per_img=cfg.TRAIN.BATCH_ROIS_OHEM,
+                                                               cls_score=cls_score, bbox_pred=bbox_pred, labels=label,
+                                                               bbox_targets=bbox_target, bbox_weights=bbox_weight)
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=labels_ohem, normalization='valid', use_ignore=True, ignore_label=-1, grad_scale=1.0)
+                bbox_loss_ = bbox_weights_ohem * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS_OHEM)
+                label = labels_ohem
+            else:
+                cls_prob = mx.sym.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='valid', grad_scale=1.0)
+                bbox_loss_ = bbox_weight * mx.sym.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
+                bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / cfg.TRAIN.BATCH_ROIS)
+
+            # reshape output
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TRAIN.BATCH_IMAGES, -1, num_classes), name='cls_prob_reshape')
+            bbox_loss = mx.sym.Reshape(data=bbox_loss, shape=(cfg.TRAIN.BATCH_IMAGES, -1, 4 * num_reg_classes), name='bbox_loss_reshape')
+            group = mx.sym.Group([cls_prob, bbox_loss, mx.sym.BlockGrad(label)]) if cfg.TRAIN.ENABLE_OHEM else mx.sym.Group([cls_prob, bbox_loss])
+        else:
+            cls_prob = mx.sym.SoftmaxActivation(name='cls_prob', data=cls_score)
+            cls_prob = mx.sym.Reshape(data=cls_prob, shape=(cfg.TEST.BATCH_IMAGES, -1, num_classes),
+                                      name='cls_prob_reshape')
+            bbox_pred = mx.sym.Reshape(data=bbox_pred, shape=(cfg.TEST.BATCH_IMAGES, -1, 4 * num_reg_classes),
+                                       name='bbox_pred_reshape')
+            group = mx.sym.Group([rfcn_cls_offset, cls_prob, bbox_pred])
+
+        self.sym = group
+        return group
+
+    def init_weight_rpn(self, cfg, arg_params, aux_params):
+        arg_params['rpn_conv_3x3_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_conv_3x3_weight'])
+        arg_params['rpn_conv_3x3_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_conv_3x3_bias'])
+        arg_params['rpn_cls_score_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_cls_score_weight'])
+        arg_params['rpn_cls_score_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_cls_score_bias'])
+        arg_params['rpn_bbox_pred_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rpn_bbox_pred_weight'])
+        arg_params['rpn_bbox_pred_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rpn_bbox_pred_bias'])
+
+    def init_weight_rfcn(self, cfg, arg_params, aux_params):
+        arg_params['conv_new_1_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['conv_new_1_weight'])
+        arg_params['conv_new_1_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['conv_new_1_bias'])
+        arg_params['rfcn_cls_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rfcn_cls_weight'])
+        arg_params['rfcn_cls_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_cls_bias'])
+        arg_params['rfcn_bbox_weight'] = mx.random.normal(0, 0.01, shape=self.arg_shape_dict['rfcn_bbox_weight'])
+        arg_params['rfcn_bbox_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_bbox_bias'])
+        arg_params['rfcn_cls_offset_t_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_cls_offset_t_weight'])
+        arg_params['rfcn_cls_offset_t_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_cls_offset_t_bias'])
+        arg_params['rfcn_bbox_offset_t_weight'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_bbox_offset_t_weight'])
+        arg_params['rfcn_bbox_offset_t_bias'] = mx.nd.zeros(shape=self.arg_shape_dict['rfcn_bbox_offset_t_bias'])
+
+    def init_weight(self, cfg, arg_params, aux_params):
+        self.init_weight_rpn(cfg, arg_params, aux_params)
+        self.init_weight_rfcn(cfg, arg_params, aux_params)
+
diff --git a/rfcn/symbols/resnet_v1_101_rfcn_dcn.py b/rfcn/symbols/resnet_v1_101_rfcn_dcn.py
index deed243..0bcf652 100644
--- a/rfcn/symbols/resnet_v1_101_rfcn_dcn.py
+++ b/rfcn/symbols/resnet_v1_101_rfcn_dcn.py
@@ -934,53 +934,26 @@ def get_symbol_rfcn(self, cfg, is_train=True):
         relu1 = self.get_resnet_v1_conv5(conv_feat)
 
         # conv_new_1
-        conv_new_1_weight = mx.symbol.Variable('conv_new_1_weight', lr_mult=3.0)
-        conv_new_1_bias = mx.symbol.Variable('conv_new_1_bias', lr_mult=6.0)
-        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1",
-                                        weight=conv_new_1_weight, bias=conv_new_1_bias)
+        conv_new_1 = mx.sym.Convolution(data=relu1, kernel=(1, 1), num_filter=1024, name="conv_new_1", lr_mult=3.0)
         relu_new_1 = mx.sym.Activation(data=conv_new_1, act_type='relu', name='relu1')
 
         # rfcn_cls/rfcn_bbox
-        rfcn_cls_weight = mx.symbol.Variable('rfcn_cls_weight', lr_mult=1.0)
-        rfcn_cls_bias = mx.symbol.Variable('rfcn_cls_bias', lr_mult=2.0)
-        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7 * 7 * num_classes, name="rfcn_cls",
-                                      weight=rfcn_cls_weight, bias=rfcn_cls_bias)
-        rfcn_bbox_weight = mx.symbol.Variable('rfcn_bbox_weight', lr_mult=1.0)
-        rfcn_bbox_bias = mx.symbol.Variable('rfcn_bbox_bias', lr_mult=2.0)
-        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7 * 7 * 4 * num_reg_classes,
-                                       name="rfcn_bbox",
-                                       weight=rfcn_bbox_weight, bias=rfcn_bbox_bias)
+        rfcn_cls = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*num_classes, name="rfcn_cls")
+        rfcn_bbox = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7*7*4*num_reg_classes, name="rfcn_bbox")
         # trans_cls / trans_cls
-        rfcn_cls_offset_t_weight = mx.symbol.Variable('rfcn_cls_offset_t_weight', lr_mult=1.0)
-        rfcn_cls_offset_t_bias = mx.symbol.Variable('rfcn_cls_offset_t_bias', lr_mult=2.0)
-        rfcn_cls_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=2 * 7 * 7 * num_classes,
-                                               name="rfcn_cls_offset_t",
-                                               weight=rfcn_cls_offset_t_weight, bias=rfcn_cls_offset_t_bias)
-        rfcn_bbox_offset_t_weight = mx.symbol.Variable('rfcn_bbox_offset_t_weight', lr_mult=1.0)
-        rfcn_bbox_offset_t_bias = mx.symbol.Variable('rfcn_bbox_offset_t_bias', lr_mult=2.0)
-        rfcn_bbox_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7 * 7 * 2,
-                                                name="rfcn_bbox_offset_t",
-                                                weight=rfcn_bbox_offset_t_weight, bias=rfcn_bbox_offset_t_bias)
+        rfcn_cls_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=2 * 7 * 7 * num_classes, name="rfcn_cls_offset_t")
+        rfcn_bbox_offset_t = mx.sym.Convolution(data=relu_new_1, kernel=(1, 1), num_filter=7 * 7 * 2, name="rfcn_bbox_offset_t")
 
-        rfcn_cls_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_cls_offset', data=rfcn_cls_offset_t,
-                                                                rois=rois, group_size=7, pooled_size=7,
-                                                                sample_per_part=4, no_trans=True, part_size=7,
-                                                                output_dim=2 * num_classes, spatial_scale=0.0625)
-        rfcn_bbox_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_bbox_offset', data=rfcn_bbox_offset_t,
-                                                                 rois=rois, group_size=7, pooled_size=7,
-                                                                 sample_per_part=4, no_trans=True, part_size=7,
-                                                                 output_dim=2, spatial_scale=0.0625)
+        rfcn_cls_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_cls_offset', data=rfcn_cls_offset_t, rois=rois, group_size=7, pooled_size=7,
+                                                                sample_per_part=4, no_trans=True, part_size=7, output_dim=2 * num_classes, spatial_scale=0.0625)
+        rfcn_bbox_offset = mx.contrib.sym.DeformablePSROIPooling(name='rfcn_bbox_offset', data=rfcn_bbox_offset_t, rois=rois, group_size=7, pooled_size=7,
+                                                                 sample_per_part=4, no_trans=True, part_size=7, output_dim=2, spatial_scale=0.0625)
 
-        psroipooled_cls_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls,
-                                                                     rois=rois, trans=rfcn_cls_offset,
-                                                                     group_size=7, pooled_size=7, sample_per_part=4,
-                                                                     no_trans=False, trans_std=0.1,
-                                                                     output_dim=num_classes, spatial_scale=0.0625,
-                                                                     part_size=7)
-        psroipooled_loc_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox,
-                                                                     rois=rois, trans=rfcn_bbox_offset,
-                                                                     group_size=7, pooled_size=7, sample_per_part=4,
-                                                                     no_trans=False, trans_std=0.1,
+        psroipooled_cls_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_cls_rois', data=rfcn_cls, rois=rois, trans=rfcn_cls_offset,
+                                                                     group_size=7, pooled_size=7, sample_per_part=4, no_trans=False, trans_std=0.1,
+                                                                     output_dim=num_classes, spatial_scale=0.0625, part_size=7)
+        psroipooled_loc_rois = mx.contrib.sym.DeformablePSROIPooling(name='psroipooled_loc_rois', data=rfcn_bbox, rois=rois, trans=rfcn_bbox_offset,
+                                                                     group_size=7, pooled_size=7, sample_per_part=4, no_trans=False, trans_std=0.1,
                                                                      output_dim=8, spatial_scale=0.0625, part_size=7)
         cls_score = mx.sym.Pooling(name='ave_cls_scors_rois', data=psroipooled_cls_rois, pool_type='avg', global_pool=True, kernel=(7, 7))
         bbox_pred = mx.sym.Pooling(name='ave_bbox_pred_rois', data=psroipooled_loc_rois, pool_type='avg', global_pool=True, kernel=(7, 7))