Skip to content

Latest commit

 

History

History
 
 

visionlan

English | 中文

VisionLAN

VisionLAN: From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

1. Introduction

1.1 VisionLAN

Visual Language Modeling Network (VisionLAN) [1] is a text recognion model that learns the visual and linguistic information simultaneously via character-wise occluded feature maps in the training stage. This model does not require an extra language model to extract linguistic information, since the visual and linguistic information can be learned as a union.

Figure 1. The architecture of visionlan [1]

As shown above, the training pipeline of VisionLAN consists of three modules:

  • The backbone extract visual feature maps from the input image;

  • The Masked Language-aware Module (MLM) takes the visual feature maps and a randomly selected character index as inputs, and generates position-aware character mask map to create character-wise occluded feature maps;

  • Finally, the Visual Reasonin Module (VRM) takes occluded feature maps as inputs and makes prediction under the complete word-level supervision.

While in the test stage, MLM is not used. Only the backbone and VRM are used for prediction.

2. Results

2.1 Accuracy

According to our experiments, the evaluation results on ten public benchmark datasets is as follow:

Model Context Backbone Train Dataset Model Params Avg Accuracy Train Time Per Step Time FPS Recipe Download
visionlan D910x4-MS2.0-G resnet45 MJ+ST 42.2M 90.61% 7718 s/epoch 417 ms/step 1,840 img/s yaml(LF_1) yaml(LF_2) yaml(LA) ckpt files | mindir(LA)
Detailed accuracy results for ten benchmark datasets
Model Context IC03_860 IC03_867 IC13_857 IC13_1015 IC15_1811 IC15_2077 IIIT5k_3000 SVT SVTP CUTE80 Average
visionlan D910x4-MS2.0-G 96.16% 95.16% 95.92% 94.19% 84.04% 77.46% 95.53% 92.27% 85.74% 89.58% 90.61%

Notes:

  • Context: Training context denoted as {device}x{pieces}-{MS version}-{MS mode}. Mindspore mode can be either G (graph mode) or F (pynative mode). For example, D910x4-MS2.0-G denotes training on 4 pieces of 910 NPUs using graph mode based on MindSpore version 2.0.0.
  • Train datasets: MJ+ST stands for the combination of two synthetic datasets, SynthText(800k) and MJSynth.
  • To reproduce the result on other contexts, please ensure the global batch size is the same.
  • The models are trained from scratch without any pre-training. For more dataset details of training and evaluation, please refer to 3.2 Dataset preparation section.
  • The input Shape of MindIR of VisionLAN is (1, 3, 64, 256).

3. Quick Start

3.1 Installation

Please refer to the installation instruction in MindOCR.

3.2 Dataset preparation

Training sets

The authors of VisionLAN used two synthetic text datasets for training: SynthText(800k) and MJSynth. Please follow the instructions of the original VisionLAN repository to download the training sets.

After download SynthText.zip and MJSynth.zip, please unzip and place them under ./datasets/train. The training set contain 14,200,701 samples in total. More details are as follows:

Validation sets

The authors of VisionLAN used six real text datasets for evaluation: IIIT5K Words (IIIT5K_3000) ICDAR 2013 (IC13_857), Street View Text (SVT), ICDAR 2015 (IC15), Street View Text-Perspective (SVTP), CUTE80 (CUTE). We used the sum of the six benchmarks as validation sets. Please follow the instructions of the original VisionLAN repository to download the validation sets.

After download evaluation.zip, please unzip this zip file, and place them under ./datasets. Under ./datasets/evaluation, there are seven folders:

  • IIIT5K: 50M, 3000 samples
  • IC13: 72M, 857 samples
  • SVT: 2.4M, 647 samples
  • IC15: 21M, 1811 samples
  • SVTP: 1.8M, 645 samples
  • CUTE: 8.8M, 288 samples
  • Sumof6benchmarks: 155M, 7248 samples

During training, we only used the data under ./datasets/evaluation/Sumof6benchmarks as the validation sets. Users can delete the other folders ./datasets/evaluation optionally.

Test Sets

We choose ten benchmarks as the test sets to evaluate the model's performance. Users can download the test sets from here (ref: deep-text-recognition-benchmark). Only the evaluation.zip is required for testing.

After downloading the evaluation.zip, please unzip it, and rename the folder name from evaluation to test. Please place this folder under ./datasets/.

The test sets contain 12,067 samples in total. The detailed information is as follows:

In the end of preparation, the file structure should be like:

datasets
├── test
│   ├── CUTE80
│   ├── IC03_860
│   ├── IC03_867
│   ├── IC13_857
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IC15_2077
│   ├── IIIT5k_3000
│   ├── SVT
│   ├── SVTP
├── evaluation
│   ├── Sumof6benchmarks
│   ├── ...
└── train
    ├── MJSynth
    └── SynText

3.3 Update yaml config file

If the datasets are placed under ./datasets, there is no need to change the train.dataset.dataset_root in the yaml configuration file configs/rec/visionlan/visionlan_L*.yaml.

Otherwise, change the following fields accordingly:

...
train:
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/dataset          <--- Update
    data_dir: train                       <--- Update
...
eval:
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/dataset          <--- Update
    data_dir: evaluation/Sumof6benchmarks <--- Update
...

Optionally, change train.loader.num_workers according to the cores of CPU.

Apart from the dataset setting, please also check the following important args: system.distribute, system.val_while_train, common.batch_size. Explanations of these important args:

system:
  distribute: True                                                    # `True` for distributed training, `False` for standalone training
  amp_level: 'O0'
  seed: 42
  val_while_train: True                                               # Validate while training
common:
  ...
  batch_size: &batch_size 192                                          # Batch size for training
...
  loader:
      shuffle: False
      batch_size: 64                                                  # Batch size for validation/evaluation
...

Notes:

  • As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust batch_size accordingly to keep the global batch size unchanged for a different number of NPUs, or adjust the learning rate linearly to a new global batch size.

3.4 Training

The training stages include Language-free (LF) and Language-aware (LA) process, and in total three steps for training:

LF_1: train backbone and VRM, without training MLM
LF_2: train MLM and finetune backbone and VRM
LA: using the mask generated by MLM to occlude feature maps, train backbone, MLM, and VRM

We used distributed training for the three steps. For standalone training, please refer to the recognition tutorial.

mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LF_1.yaml
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LF_2.yaml
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LA.yaml

The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg ckpt_save_dir in yaml config file. The default directory is ./tmp_visionlan.

3.5 Test

After all three steps training, change the system.distribute to False in configs/rec/visionlan/visionlan_resnet45_LA.yaml before testing.

To evaluate the model's accuracy, users can choose from two options:

  • Option 1: Repeat the evaluation step for all individual datasets: CUTE80, IC03_860, IC03_867, IC13_857, IC131015, IC15_1811, IC15_2077, IIIT5k_3000, SVT, SVTP. Then take the average score.

An example of evaluation script fort the CUTE80 dataset is shown below.

model_name="e8"
yaml_file="configs/rec/visionlan/visionlan_resnet45_LA.yaml"
training_step="LA"

python tools/eval.py --config $yaml_file --opt eval.dataset.data_dir=test/CUTE80 eval.ckpt_load_path="./tmp_visionlan/${training_step}/${model_name}.ckpt"
  • Option 2: Given that all the benchmark datasets folder are under the same directory, e.g. test/. And use the script tools/benchmarking/multi_dataset_eval.py. The example evaluation script is like:
model_name="e8"
yaml_file="configs/rec/visionlan/visionlan_resnet45_LA.yaml"
training_step="LA"

python tools/benchmarking/multi_dataset_eval.py --config $yaml_file --opt eval.dataset.data_dir="test" eval.ckpt_load_path="./tmp_visionlan/${training_step}/${model_name}.ckpt"

4. Inference

4.1 Prepare MINDIR file

Please download the MINDIR file from the table above, or you can use tools/export.py to manually convert any checkpoint file into a MINDIR file:

# For more parameter usage details, please execute `python tools/export.py -h`
python tools/export.py --model_name_or_config visionlan_resnet45 --data_shape 64 256 --local_ckpt_path /path/to/visionlan-ckpt

This command will save a visionlan_resnet45.mindir under the current working directory.

Learn more about Model Export.

4.2 Mindspore Lite Converter Tool

If you haven't downloaded MindSpore Lite, please download it via this link. More details on how to use MindSpore Lite in Linux Environment refer to this document.

converter_lite tool is an executable program under mindspore-lite-{version}-linux-x64/tools/converter/converter. Using converter_lite, we can convert the MindIR file to MindSpore Lite MindIR file.

Run the following command:

converter_lite \
     --saveType=MINDIR \
     --fmk=MINDIR \
     --optimize=ascend_oriented \
     --modelFile=path/to/mindir/file \
     --outputFile=visionlan_resnet45_lite

Running this command will save a visionlan_resnet45_lite.mindir under the current working directory. This is the MindSpore Lite MindIR file that we can run inference with on the Ascend310 or 310P platform. You can also define a different file name by changing the --outputFile argument.

Learn more about Model Conversion.

4.3 Inference on A Folder of Images

Taking SVT test set as an example, the data structure under the dataset folder is:

svt_dataset
├── test
│   ├── 16_16_0_ORPHEUM.jpg
│   ├── 12_13_4_STREET.jpg
│   ├── 17_08_TOWN.jpg
│   ├── ...
└── test_gt.txt

We run inference with the visionlan_resnet45_lite.mindir file with the command below:

python deploy/py_infer/infer.py  \
    --input_images_dir=/path/to/svt_dataset/test  \
    --rec_model_path=/path/to/visionlan_resnet45_lite.mindir  \
    --rec_model_name_or_config=configs/rec/visionlan/visionlan_resnet45_LA.yaml \
    --res_save_dir=rec_svt

Running this command will create a folder rec_svt under the current working directory, and save a prediction file rec_svt/rec_results.txt. Some examples of prediction in this file are shown below:

16_16_0_ORPHEUM.jpg "orpheum"
12_13_4_STREET.jpg  "street"
17_08_TOWN.jpg  "town"
...

Then, we can compute the prediction accuracy using:

python deploy/eval_utils/eval_rec.py  \
    --pred_path=rec_svt/rec_results.txt  \
    --gt_path=/path/to/svt_dataset/test_gt.txt

The evaluation results are shown below:

{'acc': 0.9227202534675598, 'norm_edit_distance': 0.9720136523246765}

5. References

[1] Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang: From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. ICCV 2021: 14174-14183