BLIP-2: 全名Bootstrapping Language-Image Pre-training - 2
模型是2023 年 Salesforce提出的一种多模态模型,它从现成的冻结预训练图像编码器 (ViT)和冻结的大型语言模型 (LLM)中引导视觉语言预训练 (contrastive_language_image_pretrain), 中间添加一个轻量级的 Querying Transformer 弥补了模态 gap, 该 Transformer 分两个阶段进行预训练:
- 第一阶段从冻结图像编码器引导视觉-语言表示学习,强制 Q-Former 学习与文本最相关的视觉表示。
- 第二阶段基于冻结的语言模型引导从视觉到语言的生成学习,将Q-Former的输出连接到冻结的LLM,并对Q-Former进行训练,使其输出视觉表示能够被LLM解释。
论文 Junnan Li,et al., BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, 2023
@inproceedings{li2023blip2,
title={{BLIP-2:} Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models},
author={Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi},
year={2023},
booktitle={ICML},
}
- 基于910A
config | task | Datasets | metric | score | train performance | predict performance |
---|---|---|---|---|---|---|
blip2_stage1_vit_g | BLIP-2 stage1 pretrain | [train]: coco [eval]: flickr30k (test) | itm_score | txt_r1: 89.8 img_r1: 77.08 | 49.97 samples/s | - |
blip2_stage1_classification | image_classification | [eval]: cifar100 | accuracy | - | - | 5.61 iters/s |
itt_blip2_stage2_vit_g_llama_7b | image_to_text_generation | - | - | - | - | 22 tokens/s(use past True) |
BLIP-2
基于 mindformers
实现,目前支持一阶段的训练,评估,推理以及二阶段的推理(语言模型包含Llama7b及Baichuan7b)。 在二阶段使用baichuan_7b
作为语言模型时进行推理时,需要自行在配置文件configs/blip2/run_blip2_stage2_vit_g_baichuan_7b_image_to_text_generation.yaml
中
配置对应的baichuan-7b权重文件及词表文件,如何配置可参考上文中关键配置项说明章节;权重文件和词表文件的获取及转换参考baichuan_7b.
本实现主要涉及的文件有:
-
模型具体实现:
mindformers/models/blip2
blip2 ├── __init__.py ├── convert_weight.py # 权重转换脚本 ├── blip2.py # 模型基础类实现 ├── qformer.py # QFormer实现 ├── blip2_qformer.py # BLIP-2一阶段QFormer实现 ├── blip2_llama.py # BLIP-2二阶段接入Llama模型实现 ├── qformer_config.py # QFormer配置项 ├── blip2_config.py # BLIP-2配置项,包含QFormer配置项 ├── blip2_processor.py # Model预处理
-
模型配置:
configs/blip2
blip2 ├── run_blip2_stage1_vit_g_qformer_pretrain.yaml # BLIP-2一阶段预训练启动配置 ├── run_blip2_stage1_vit_g_retrieval_flickr30k.yaml # BLIP-2使用一阶段预训练模型做图像检索任务启动配置 ├── run_blip2_stage1_vit_g_zero_shot_image_classification_cifar100.yaml # BLIP-2使用一阶段预训练模型做zero-shot图像分类任务启动配置 └── run_blip2_stage2_vit_g_baichuan_7b.yaml # BLIP-2二阶段预训练启动配置(使用baichuan7b作为语言模型) └── run_blip2_stage2_vit_g_baichuan_7b_image_to_text_generation.yaml # BLIP-2使用二阶段预训练做图生问任务启动配置(使用baichuan7b作为语言模型) └── run_blip2_stage2_vit_g_llama_7b.yaml # BLIP-2二阶段预训练启动配置(使用llama7b作为语言模型) └── run_blip2_stage2_vit_g_llama_7b_image_to_text_generation.yaml # BLIP-2使用二阶段预训练做图生问任务启动配置(使用llama7b作为语言模型)
-
配置文件关键配置项
train_dataset:
text_transforms:
type: CaptionTransform
prompt: "" # 二阶段中语言模型添加的prompt,目前仅支持固定prompt
max_length: 32 # token ids的最大长度,在二阶段训练时,为了构造label,需要设置为33,即比预期最大长度加1
tokenizer:
type: LlamaTokenizer
vocab_file: "" # 词表文件路径
model:
model_config:
type: Blip2Config
max_txt_len: 32 # token ids的最大长度,需要与train_dataset.text_transforms.max_length保持一致
checkpoint_name_or_path: "" # 模型的预训练权重,在二阶段训练时,可在此处配置一阶段训练得到的权重
prompt: False # 二阶段训练中,是否使用prompt
prompt_length: 0 # 二阶段训练中,使用的固定prompt的token长度,具体值根据tokenizer及train_dataset.text_transforms.prompt得到
# 以上两项配置与train_dataset.text_transforms.prompt需要对应
vision_config: # ImageEncoder的相关配置
type: ViTConfig
image_size: 224 # 输入图像大小
checkpoint_name_or_path: "vit_g_p16" # vit的权重文件
qformer_config:
type: QformerConfig
query_length: 32 # query数目
text_config: # 二阶段语言模型相关配置
type: LlamaConfig
seq_length: 64 # 语言模型的输入seq_length大小,在训练时该值要等于qformer的query数目+语言模型输入文字token id长度
# 即 seq_length=model.model_config.qformer_config.query_length + model.model_config.max_txt_len
checkpoint_name_or_path: "" # 语言模型权重文件
arch:
type: XXXX # 根据BLIP-2的不同阶段填写不同模型结构,
# 一阶段:1)预训练或图像检索 Blip2Qformer 2) zero-shot图像分类 Blip2Classifier 3)
# 二阶段:推理/图生文任务 Blip2ImageToTextGeneration 使用llama7b或baichuan7b作为语言模型)
# 推理任务时需要配置
processor:
type: Blip2Processor
image_processor:
type: Blip2ImageProcessor
image_size: 224 # 输入图像大小
tokenizer:
type: LlamaTokenizer
max_length: 32
vocab_file: "" # 词表文件路径
运行mindformers/tools/hccl_tools.py生成RANK_TABLE_FILE的json文件
# 运行如下命令,生成当前机器的RANK_TABLE_FILE的json文件
python ./mindformers/tools/hccl_tools.py --device_num "[0,8)"
注:若使用ModelArts的notebook环境,可从
/user/config/jobstart_hccl.json
路径下直接获取rank table,无需手动生成
RANK_TABLE_FILE 单机8卡参考样例:
{
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "xx.xx.xx.xx",
"device": [
{"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"},
{"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"},
{"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"},
{"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"},
{"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"},
{"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"},
{"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"},
{"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}
- step 1. 首先根据上章节内容,在每个机器上生成各自的
RANK_TABLE_FILE
文件,然后将不同机器上生成的RANK_TABLE_FILE
文件全部拷贝到同一台机器上。
# 运行如下命令,生成当前机器的RANK_TABLE_FILE的json文件
python ./mindformers/tools/hccl_tools.py --device_num "[0,8)" --server_ip xx.xx.xx.xx
注:需要根据机器的ip地址指定 --server_ip,避免由于不同机器server_ip不同,导致多节点间通信失败。
- step 2. 运行mindformers/tools/merge_hccl.py将不同机器上生成的
RANK_TABLE_FILE
文件合并
# 运行如下命令,合并每个机器上的RANK_TABLE_FILE的json文件。
python ./mindformers/tools/merge_hccl.py hccl*.json
- step 3. 将合并后的
RANK_TABLE_FILE
文件拷贝到所有机器中,保证不同机器上的RANK_TABLE_FILE
相同。
RANK_TABLE_FILE 双机16卡参考样例:
{
"version": "1.0",
"server_count": "2",
"server_list": [
{
"server_id": "xx.xx.xx.xx",
"device": [
{
"device_id": "0", "device_ip": "192.168.0.0", "rank_id": "0"
},
{
"device_id": "1", "device_ip": "192.168.1.0", "rank_id": "1"
},
{
"device_id": "2", "device_ip": "192.168.2.0", "rank_id": "2"
},
{
"device_id": "3", "device_ip": "192.168.3.0", "rank_id": "3"
},
{
"device_id": "4", "device_ip": "192.168.0.1", "rank_id": "4"
},
{
"device_id": "5", "device_ip": "192.168.1.1", "rank_id": "5"
},
{
"device_id": "6", "device_ip": "192.168.2.1", "rank_id": "6"
},
{
"device_id": "7", "device_ip": "192.168.3.1", "rank_id": "7"
}
],
"host_nic_ip": "reserve"
},
{
"server_id": "xx.xx.xx.xx",
"device": [
{
"device_id": "0", "device_ip": "192.168.0.1", "rank_id": "8"
},
{
"device_id": "1", "device_ip": "192.168.1.1", "rank_id": "9"
},
{
"device_id": "2", "device_ip": "192.168.2.1", "rank_id": "10"
},
{
"device_id": "3", "device_ip": "192.168.3.1", "rank_id": "11"
},
{
"device_id": "4", "device_ip": "192.168.0.2", "rank_id": "12"
},
{
"device_id": "5", "device_ip": "192.168.1.2", "rank_id": "13"
},
{
"device_id": "6", "device_ip": "192.168.2.2", "rank_id": "14"
},
{
"device_id": "7", "device_ip": "192.168.3.2", "rank_id": "15"
}
],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}
本仓库中的blip2_stage1_classification
来自于LAVIS的一阶段预训练权重blip2_stage1_pretrained
, 基于下述的步骤获取:
-
从此链接中下载
blip2_stage1_pretrained
的pytorch权重,文件名为blip2_pretrained.pth
-
执行转换脚本,得到转换后的输出文件
blip2_stage1_pretrained.ckpt
权重转换步骤+权重转换命令
python mindformers/models/blip2/convert_weight.py --torch_path blip2_pretrained.pth --mindspore_path ./blip2_stage1_pretrained.ckpt
暂不涉及
可以使用AutoClass接口,通过模型名称获取相应的model/preprocess/tokenizer等实例,并自动下载并加载权重
from_pretrained()
接口会自动从云上下载预训练的模型,存储路径:mindformers/checkpoint_download/model_name
import mindspore
from mindformers import AutoModel, AutoProcessor
from mindformers.tools.image_tools import load_image
# 指定图模式,指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
# 通过AutoClass创建一阶段预训练任务
model = AutoModel.from_pretrained("blip2_stage1_vit_g")
# 通过AutoClass创建二阶段图生文任务
model = AutoModel.from_pretrained("itt_blip2_stage2_vit_g_llama_7b")
processor = AutoProcessor.from_pretrained("itt_blip2_stage2_vit_g_llama_7b")
tokenizer = processor.tokenizer
filepath = "https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009448.jpg"
input_images = processor.image_processor(load_image(filepath))
input_ids = tokenizer("it is a photo of", padding="max_length", return_tensors="ms")["input_ids"]
outputs = model.generate_text_for_image(input_images, input_ids)
response = tokenizer.decode(outputs, skip_special_tokens=True)
print(response)
# ['it is a photo of a girl holding an umbrella']
注:快速使用仅限单卡
BLIP-2
一阶段训练
import mindspore
from mindformers.trainer import Trainer
# 指定图模式,指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
# 初始化图像-文本数据集配置
data_loader = dict(
type = 'MultiImgCapDataLoader',
dataset_dir = "/data",
annotation_files = [
"vg/annotations/vg_caption.json",
"coco2014/coco/annotations/coco_karpathy_train.json"
],
image_dirs = [
"vg/images",
"coco2014/coco/images"
],
stage = "train",
column_names = ["image", "text"],
shuffle = True,
)
# 通过修改args参数间接修改配置文件中的dataset设置,而且不改变transform过程
dataset_config = dict(data_loader=data_loader)
train_dataset_task = dict(dataset_config=dataset_config)
args = dict(train_dataset_task=train_dataset_task,
train_dataset=dataset_config)
# BLIP-2一阶段初始化预训练任务
trainer = Trainer(task='contrastive_language_image_pretrain',
model='blip2_stage1_vit_g',
args=args)
# 开启预训练
trainer.train()
BLIP-2
一阶段评估
import mindspore
from mindformers.trainer import Trainer
# 指定图模式,指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
# 初始化图像-文本数据集配置
data_loader = dict(
type = 'MultiImgCapDataLoader',
dataset_dir = "/data",
annotation_files = [
"flickr30k/annotations/test.json"
],
image_dirs = [
"flickr30k/images"
],
stage = "eval",
column_names = ["image", "text"],
shuffle = False,
)
# 通过修改args参数间接修改配置文件中的dataset设置,而且不改变transform过程
dataset_config = dict(data_loader=data_loader)
eval_dataset_task = dict(dataset_config=dataset_config)
args = dict(eval_dataset_task=eval_dataset_task,
eval_dataset=dataset_config)
# 初始化评估任务
trainer = Trainer(task='image_to_text_retrieval',
model='blip2_stage1_evaluator',
args=args)
# 开启评估
trainer.evaluate()
BLIP-2
一阶段推理
from mindformers.tools.image_tools import load_image
from mindformers import Trainer
cls_trainer = Trainer(task='zero_shot_image_classification',
model='blip2_stage1_classification',
candidate_labels=["sunflower", "tree", "dog", "cat", "toy"])
# 加载输入,一张太阳花图片
input_data = load_image("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
# 加载默认的权重以完成推理
predict_result = cls_trainer.predict(input_data=input_data)
print(predict_result)
# 输出
# [[{'score': 0.99999976, 'label': 'sunflower'}]]
BLIP-2
二阶段推理
import mindspore; mindspore.set_context(mode=0, device_id=0)
from mindformers.tools.image_tools import load_image
from mindformers import Trainer
generate_trainer = Trainer(task='image_to_text_generation',
model='itt_blip2_stage2_vit_g_llama_7b')
# 加载输入,一张太阳花图片
input_data = load_image(
"https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
# 加载指定的权重以完成推理
predict_result = generate_trainer.predict(input_data=input_data,
hypothesis_template="a picture of")
print(predict_result)
# 输出
# ['a picture of a yellow flower']
注:基于Baichuan7b
的BLIP-2
未提供预训练模型。
BLIP-2
一阶段推理
import mindspore
from mindformers.pipeline import pipeline
from mindformers.tools.image_tools import load_image
# 指定图模式,指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
pipeline_task = pipeline(task="zero_shot_image_classification", model="blip2_stage1_classification")
input_data = load_image(
"https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
pipeline_result = pipeline_task(input_data,
candidate_labels=["sunflower", "tree", "dog", "cat", "toy"],
hypothesis_template="This is a photo of {}.")
print(pipeline_result)
# 输出
# [[{'score': 0.99999714, 'label': 'sunflower'},
# {'score': 1.315181e-06, 'label': 'tree'},
# {'score': 7.0368844e-07, 'label': 'toy'},
# {'score': 4.7594781e-07, 'label': 'dog'},
# {'score': 3.93686e-07, 'label': 'cat'}]]
BLIP-2
二阶段推理
import mindspore as ms
from mindformers.pipeline import pipeline
from mindformers.tools.image_tools import load_image
# 指定图模式,指定使用训练卡id
ms.set_context(mode=0, device_id=0)
pipeline_task = pipeline(task="image_to_text_generation", model="itt_blip2_stage2_vit_g_llama_7b")
# 加载输入,一张太阳花图片
input_data = load_image(
"https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
predict_result = pipeline_task({
"image": input_data,
"prompt": "a picture of"})
print(predict_result)
# 输出
# ['a picture of a yellow flower']
注:快速使用仅限单卡
- coco2014数据集:
├── annotations
│ ├── coco_karpathy_test.json
│ ├── coco_karpathy_train.json
│ └── coco_karpathy_val.json
│
└── images
├── test2014
├── test2015
├── train2014
└── val2014
可通过LAVIS github库提供的链接下载annotation_files (链接) 和 images (链接)。
- Visual Genome数据集:
├── annotations
│ └── vg_caption.json
└── images
└── VG_100K
同上,可通过LAVIS github库提供的链接下载annotation_files (链接) 和 images (链接)。
- python启动训练
BLIP-2
一阶段
python run_mindformer.py --config configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml --run_mode train
- bash启动训练
BLIP-2
一阶段
cd scripts
bash run_standalone.sh --config configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml [DEVICE_ID] train
多卡运行需要RANK_FILE_TABLE,请参考前期准备-生成RANK_TABLE_FILE
- 单机多卡训练
BLIP-2
一阶段
cd scripts
bash run_distribute.sh RANK_TABLE_FILE --config configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml [0,8] train 8
多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-多机RANK_TABLE_FILE合并
- 多机多卡训练
BLIP-2
一阶段
在每台机器上启动bash run_distribute.sh
。
server_count=12
device_num=8*$server_count
# launch ranks in the 0th server
cd scripts
bash run_distribute.sh $RANK_TABLE_FILE --config configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml [0,8] train $device_num
# launch ranks in the 1-11 server via ssh
for idx in {1..11}
do
let rank_start=8*$idx
let rank_end=$rank_start+8
ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE --config configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml [$rank_start,$rank_end] train $device_num"
done
其中
RANK_TABLE_FILE
为上一步汇总并分发的总rank table文件;IP_LIST
为12台服务器的IP地址。如192.168.0.[0-11]
IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11")
暂不支持
flickr30k
├── annotations
│ ├── test.json
│ ├── train.json
│ └── val.json
└── images
└── flickr30k-images
├── 1000092795.jpg
├── 10002456.jpg
├── 1000268201.jpg
├── 1000344755.jpg
└── ...
其中flickr30k的父目录对应run_blip2_stage1_vit_g_retrieval_flickr30k.yaml配置中的eval_dataset.dataloader.dataset_dir
配置, 其余目录摆放可以根据 eval_dataset.dataloader.annotation_files
和 eval_dataset.dataloader.image_dirs
的修改自行配置,详细逻辑可以参考图文对数据集MultiImgCapDataLoader的实现。
默认配置文件只对 flickr30k - test部分数据 (1000 images, 5000 annotations) 进行评测。
python run_mindformer.py --config run_blip2_stage1_vit_g_retrieval_flickr30k.yaml --run_mode eval --eval_dataset_dir {parent_dir of flickr30k} --otherargs
# output
# {'txt_r1': 95.5, 'txt_r5': 99.9, 'txt_r10': 99.9, 'txt_r_mean': 98.43333333333334, 'img_r1': 86.6, 'img_r5': 97.16, 'img_r10': 98.52, 'img_r_mean': 94.09333333333332, 'r_mean': 96.26333333333332, 'agg_metrics': 98.43333333333334}
BLIP-2
二阶段推理中,支持通过语言模型生成图片的说明,本章节提供一个基于pipeline
的推理脚本样例,该脚本样例执行以Llama7b
为语言模型的image_to_text_generation
任务,可将脚本保存成文件blip2_stage2_pipeline_test.py
。
import argparse
import mindspore as ms
from mindformers import AutoConfig, AutoModel
from mindformers import pipeline
def init_context(device_id):
ms.set_context(mode=0, device_target="Ascend", device_id=device_id)
def build_text_input(prompts, templates):
text_input = []
for i in range(len(prompts)):
text_input.append(templates[i].format(prompts[i]))
return text_input
def str2bool(v):
v_lower = v.lower()
if v_lower in ["false", "0"]:
output = False
elif v_lower in ["true", "1"]:
output = True
else:
raise ValueError("Invalid boolean value")
return output
DEFAULT_IMAGE_TEXT_PAIR = [
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/titanic.jpg",
"Question: What happened of this movie? Answer:"),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/elephant.jpg",
"it is a photo of"),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009400.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009483.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009448.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000010363.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009769.jpg", "")
]
def main(args):
if args.image_path is None:
image_filepath = [pair[0] for pair in DEFAULT_IMAGE_TEXT_PAIR]
else:
image_filepath = args.image_path.split(',')
if args.prompt is None:
if args.image_path is not None:
prompts = [""] * len(image_filepath)
else:
prompts = [pair[1] for pair in DEFAULT_IMAGE_TEXT_PAIR]
else:
prompts = args.prompt.split(',')
if len(prompts) != len(image_filepath):
raise ValueError("prompts length do not equal to image_path length, please check the args.")
init_context(device_id=args.device_id)
model_config = AutoConfig.from_pretrained(args.model_type)
model_config.max_txt_len = args.seq_length
model_config.batch_size = 1
model_config.text_config.batch_size = model_config.batch_size
model_config.text_config.seq_length = args.seq_length + model_config.qformer_config.query_length
model_config.text_config.do_sample = args.do_sample
model_config.text_config.top_p = args.top_p
model_config.text_config.top_k = args.top_k
model_config.text_config.use_past = args.use_past
model = AutoModel.from_config(model_config)
pipeline_task = pipeline("image_to_text_generation", model=model)
inputs = [{"image": image_filepath[index],
"prompt": prompts[index]}
for index in range(len(image_filepath))]
for _ in range(args.generate_repeat_time):
output = pipeline_task(inputs)
print(output)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--model_type',
default="itt_blip2_stage2_vit_g_llama_7b",
type=str,
required=False,
help='model type')
parser.add_argument(
'--device_id',
type=int,
default=0,
required=False,
help='device id')
parser.add_argument(
'--generate_repeat_time',
type=int,
default=5,
required=False,
help='generate repeat time')
parser.add_argument(
'--use_past',
type=str2bool,
default=True,
required=False,
help='whether use past')
parser.add_argument(
'--do_sample',
type=str2bool,
default=False,
required=False,
help='whether do sample')
parser.add_argument(
'--top_p',
type=float,
default=1,
required=False,
help='top p')
parser.add_argument(
'--top_k',
type=int,
default=0,
required=False,
help='top k')
parser.add_argument(
'--seq_length',
type=int,
default=32,
required=False,
help='seq length')
parser.add_argument(
'--image_path',
type=str,
default=None,
required=False,
help='image path')
parser.add_argument(
'--prompt',
type=str,
default=None,
required=False,
help='')
args_ = parser.parse_args()
print(args_)
main(args_)
# 增量推理 开采样
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past True --do_sample True --top_k 3
# 增量推理 不进行采样
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past True --do_sample False
# 自回归推理 开采样
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past False --do_sample True --top_k 3
# 自回归 不进行采样
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past False --do_sample False
# 增量推理 开采样 指定输入
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past True --do_sample True --top_k 3 --image_path /your/path/to/image --prompt your_promt
# 增量推理 开采样 指定输入
python blip2_stage2_pipeline_test.py --device_id 0 --generate_repeat_time 3 --use_past True --do_sample True --top_k 3 --image_path /your/path/to/image1,/your/path/to/image2,/your/path/to/image3 --prompt your_promt1,,your_promt3
特殊参数说明:
-
generate_repeat_time
: 重复执行次数,会对指定的图片输入进行多次重复推理,避免第一次编图时间影响速度; -
image_path
: 指定推理的图片路径,支持输入多张图片路径,路径通过英文逗号,
间隔,例如--image_path /your/path/to/image1,/your/path/to/image2,/your/path/to/image3
当不指定路径时,会使用默认的图片进行推理;
-
prompt
: 指定推理的图片配对的文字提示,支持输入多条提示,路径通过英文逗号,
间隔,每一条prompt与image_path
中的路径相匹配,例如--prompt your_promt1,your_promt2,your_promt3
prompt
可传空值,但多条prompt仍需要以逗号隔开,当指定了image_path
但未指定prompt
时,会自动设置每张图像的prompt为""。
暂不支持
BLIP-2
二阶段推理中,支持通过语言模型生成图片的说明,本章节提供一个基于generate
的推理脚本样例,该脚本样例执行以Llama7b
为语言模型的image_to_text_generation
任务,可将脚本保存成文件blip2_stage2_generation_test.py
。
import argparse
import mindspore as ms
from mindspore import ops
from mindformers import AutoConfig, AutoModel, AutoProcessor
from mindformers.tools.image_tools import load_image
def init_context(device_id):
ms.set_context(mode=0, device_target="Ascend", device_id=device_id)
def build_text_input(prompts, templates):
text_input = []
for i in range(len(prompts)):
text_input.append(templates[i].format(prompts[i]))
return text_input
def str2bool(v):
v_lower = v.lower()
if v_lower in ["false", "0"]:
output = False
elif v_lower in ["true", "1"]:
output = True
else:
raise ValueError("Invalid boolean value")
return output
DEFAULT_IMAGE_TEXT_PAIR = [
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/titanic.jpg",
"Question: What happened of this movie? Answer:"),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/elephant.jpg",
"it is a photo of"),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009400.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009483.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009448.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000010363.jpg", ""),
("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/blip2/images/000000009769.jpg", "")
]
def main(args):
if args.image_path is None:
image_filepath = [pair[0] for pair in DEFAULT_IMAGE_TEXT_PAIR]
else:
image_filepath = args.image_path.split(',')
if args.prompt is None:
if args.image_path is not None:
prompts = [""] * len(image_filepath)
else:
prompts = [pair[1] for pair in DEFAULT_IMAGE_TEXT_PAIR]
else:
prompts = args.prompt.split(',')
if len(prompts) != len(image_filepath):
raise ValueError("prompts length do not equal to image_path length, please check the args.")
init_context(device_id=args.device_id)
model_config = AutoConfig.from_pretrained(args.model_type)
model_config.max_txt_len = args.seq_length
if args.batch_infer:
model_config.batch_size = len(image_filepath)
else:
model_config.batch_size = 1
model_config.text_config.batch_size = model_config.batch_size
model_config.text_config.seq_length = args.seq_length + model_config.qformer_config.query_length
model_config.text_config.do_sample = args.do_sample
model_config.text_config.top_p = args.top_p
model_config.text_config.top_k = args.top_k
model_config.text_config.use_past = args.use_past
model = AutoModel.from_config(model_config)
processor = AutoProcessor.from_pretrained(args.model_type)
tokenizer = processor.tokenizer
for _ in range(args.generate_repeat_time):
if model_config.batch_size > 1:
input_images = processor.image_processor([load_image(filepath) for filepath in image_filepath])
input_ids = tokenizer(prompts,
max_length=args.seq_length,
padding="max_length",
return_tensors="ms")["input_ids"]
output = model.generate_text_for_image(input_images, input_ids)
print(tokenizer.decode(output, skip_special_tokens=True))
else:
batch_size = len(image_filepath)
for index in range(batch_size):
input_image = processor.image_processor(load_image(image_filepath[index]))
input_id = tokenizer(prompts[index],
max_length=args.seq_length,
padding="max_length",
return_tensors="ms")["input_ids"]
output = model.generate_text_for_image(input_image, input_id)
print(tokenizer.decode(output, skip_special_tokens=True))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--model_type',
default="itt_blip2_stage2_vit_g_llama_7b",
type=str,
required=False,
help='model type')
parser.add_argument(
'--device_id',
type=int,
default=0,
required=False,
help='device id')
parser.add_argument(
'--batch_infer',
type=str2bool,
default=False,
required=False,
help='whether to batch infer')
parser.add_argument(
'--generate_repeat_time',
type=int,
default=5,
required=False,
help='generate repeat time')
parser.add_argument(
'--use_past',
type=str2bool,
default=True,
required=False,
help='whether use past')
parser.add_argument(
'--do_sample',
type=str2bool,
default=False,
required=False,
help='whether do sample')
parser.add_argument(
'--top_p',
type=float,
default=1,
required=False,
help='top p')
parser.add_argument(
'--top_k',
type=int,
default=0,
required=False,
help='top k')
parser.add_argument(
'--seq_length',
type=int,
default=32,
required=False,
help='seq length')
parser.add_argument(
'--image_path',
type=str,
default=None,
required=False,
help='image path')
parser.add_argument(
'--prompt',
type=str,
default=None,
required=False,
help='')
args_ = parser.parse_args()
print(args_)
main(args_)
# 单batch 增量推理 开采样
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past True --do_sample True --top_k 3
# 单batch 增量推理 不进行采样
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past True --do_sample False
# 单batch 自回归推理 开采样
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past False --do_sample True --top_k 3
# 单batch 自回归 不进行采样
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past False --do_sample False
# batch推理 增量推理 开采样
python blip2_stage2_generation_test.py --device_id 0 --batch_infer True --generate_repeat_time 3 --use_past True --do_sample True --top_k 3
# 单batch 增量推理 开采样 指定输入
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past True --do_sample True --top_k 3 --image_path /your/path/to/image --prompt your_promt
# 单batch 增量推理 开采样 指定输入
python blip2_stage2_generation_test.py --device_id 0 --batch_infer False --generate_repeat_time 3 --use_past True --do_sample True --top_k 3 --image_path /your/path/to/image1,/your/path/to/image2,/your/path/to/image3 --prompt your_promt1,,your_promt3
特殊参数说明:
-
batch_infer
: 是否开启batch推理; -
generate_repeat_time
: 重复执行次数,会对指定的图片输入进行多次重复推理,避免第一次编图时间影响速度; -
image_path
: 指定推理的图片路径,支持输入多张图片路径,路径通过英文逗号,
间隔,例如--image_path /your/path/to/image1,/your/path/to/image2,/your/path/to/image3
当不指定路径时,会使用默认的图片进行推理;
-
prompt
: 指定推理的图片配对的文字提示,支持输入多条提示,路径通过英文逗号,
间隔,每一条prompt与image_path
中的路径相匹配,例如--prompt your_promt1,your_promt2,your_promt3
prompt
可传空值,但多条prompt仍需要以逗号隔开,当指定了image_path
但未指定prompt
时,会自动设置每张图像的prompt为""。
暂不支持
BLIP-2
一阶段推理:zero shot图像分类任务
python run_mindformer.py --config configs/blip2/run_blip2_stage1_vit_g_zero_shot_image_classification_cifar100.yaml --run_mode predict --device_target Ascend --device_id 0 --predict_data /path/to/image
BLIP-2
二阶段推理:图生文任务(ImageCaption)
python run_mindformer.py --config configs/blip2/run_blip2_stage2_vit_g_llama_7b_image_to_text_generation.yaml --run_mode predict --device_target Ascend --device_id 0 --predict_data /path/to/image
注:要提高推理速度,可在对应模型配置文件中model.model_config.text_config.use_past
的值设为True
。
暂不支持