Merge pull request #56 from MingxuanXia/mingxuanxia/multimodal

Add multi modal evaluations
microsoft · Mar 13, 2024 · e856221 · e856221
2 parents 4c68eb4 + e17968d
commit e856221
Show file tree

Hide file tree

Showing 15 changed files with 1,949 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -69,6 +69,7 @@
 <!-- News and Updates -->
 
 ## News and Updates
+- [13/03/2024] Add support for multi-modal models and datasets.
 - [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets.
 - [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
 - [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) [examples/add_new_modules.md](examples/add_new_modules.md). 
@@ -161,7 +162,7 @@ import promptbench as pb
 
 We provide tutorials for:
 
-1. **evaluate models on existing benchmarks:** please refer to the [examples/basic.ipynb](examples/basic.ipynb) for constructing your evaluation pipeline.
+1. **evaluate models on existing benchmarks:** please refer to the [examples/basic.ipynb](examples/basic.ipynb) for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to [examples/multimodal.ipynb](examples/multimodal.ipynb)
 2. **test the effects of different prompting techniques:** 
 3. **examine the robustness for prompt attacks**, please refer to [examples/prompt_attack.ipynb](examples/prompt_attack.ipynb) to construct the attacks.
 4. **use DyVal for evaluation:** please refer to [examples/dyval.ipynb](examples/dyval.ipynb) to construct DyVal datasets.
@@ -185,6 +186,13 @@ PromptBench currently supports different datasets, models, prompt engineering me
 - Numersense
 - QASC
 - Last Letter Concatenate
+- VQAv2
+- NoCaps
+- MMMU
+- MathVista
+- AI2D
+- ChartQA
+- ScienceQA
 
 ### Models
 
@@ -203,6 +211,18 @@ PromptBench currently supports different datasets, models, prompt engineering me
   - GPT-4
   - Gemini Pro
 
+### Models (Multi-Modal)
+
+- Open-source models:
+  - BLIP2
+  - LLaVA
+  - Qwen-VL, Qwen-VL-Chat
+  - InternLM-XComposer2-VL
+- Proprietary models
+  - GPT-4v
+  - GeminiProVision
+  - Qwen-VL-Max, Qwen-VL-Plus
+
 ### Prompt Engineering
 
 - Chain-of-thought (COT) [1]
@@ -239,10 +259,6 @@ PromptBench currently supports different datasets, models, prompt engineering me
 
 Please refer to our [benchmark website](https://llm-eval.github.io/) for benchmark results on Prompt Attacks, Prompt Engineering and Dynamic Evaluation DyVal.
 
-## TODO
-
-- [ ] Add support for multi-modal models such as LlaVa and BLIP2.
-
 ## Acknowledgements
 
 - [TextAttack](https://github.com/QData/TextAttack)

diff --git a/docs/examples/multimodal.md b/docs/examples/multimodal.md
@@ -0,0 +1,134 @@
+# Multi-Modal Models
+
+This example will walk you throught the basic usage of MULTI-MODAL models in PromptBench. We hope that you can get familiar with the APIs and use it in your own projects later.
+
+First, there is a unified import of `import promptbench as pb` that easily imports the package.
+
+
+```python
+import promptbench as pb
+```
+
+    /anaconda/envs/promptbench_1/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
+      from .autonotebook import tqdm as notebook_tqdm
+
+
+## Load dataset
+
+First, PromptBench supports easy load of datasets.
+
+
+```python
+# print all supported datasets in promptbench
+print('All supported datasets: ')
+print(pb.SUPPORTED_DATASETS_VLM)
+
+# load a dataset, MMMMU, for instance.
+# if the dataset is not available locally, it will be downloaded automatically.
+dataset = pb.DatasetLoader.load_dataset("mmmu")
+
+# print the first 5 examples
+dataset[:5]
+```
+
+    All supported datasets: 
+    ['vqav2', 'nocaps', 'science_qa', 'math_vista', 'ai2d', 'mmmu', 'chart_qa']
+
+
+
+
+
+    [{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=733x237>],
+      'answer': 'B',
+      'question': '<image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\nA: $6\nB: $7\nC: $8\nD: $9'},
+     {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=342x310>],
+      'answer': 'C',
+      'question': 'Assume accounts have normal balances, solve for the one missing account balance: Dividends. Equipment was recently purchased, so there is neither depreciation expense nor accumulated depreciation. <image 1>\nA: $194,815\nB: $182,815\nC: $12,000\nD: $9,000'},
+     {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=336x169>],
+      'answer': 'B',
+      'question': 'Maxwell Software, Inc., has the following mutually exclusive projects.Suppose the company uses the NPV rule to rank these two projects.<image 1> Which project should be chosen if the appropriate discount rate is 15 percent?\nA: Project A\nB: Project B'},
+     {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1222x237>],
+      'answer': 'D',
+      'question': "Each situation below relates to an independent company's Owners' Equity. <image 1> Calculate the missing values of company 2.\nA: $1,620\nB: $12,000\nC: $51,180\nD: $0"},
+     {'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1219x217>],
+      'answer': 'B',
+      'question': 'The following data show the units in beginning work in process inventory, the number of units started, the number of units transferred, and the percent completion of the ending work in process for conversion. Given that materials are added at the beginning of the process, what are the equivalent units for conversion costs for each quarter using the weighted-average method? Assume that the quarters are independent.<image 1>\nA: 132,625\nB: 134,485\nC: 135,332\nD: 132,685'}]
+
+
+
+## Load models
+
+Then, you can easily load VLM models via promptbench.
+
+
+```python
+# print all supported models in promptbench
+print('All supported models: ')
+print(pb.SUPPORTED_MODELS_VLM)
+
+# load a model, llava-1.5-7b, for instance.
+model = pb.VLMModel(model='llava-hf/llava-1.5-7b-hf', max_new_tokens=2048, temperature=0.0001, device='cuda')
+```
+
+    All supported models: 
+    ['Salesforce/blip2-opt-2.7b', 'Salesforce/blip2-opt-6.7b', 'Salesforce/blip2-flan-t5-xl', 'Salesforce/blip2-flan-t5-xxl', 'llava-hf/llava-1.5-7b-hf', 'llava-hf/llava-1.5-13b-hf', 'gemini-pro-vision', 'gpt-4-vision-preview', 'Qwen/Qwen-VL', 'Qwen/Qwen-VL-Chat', 'qwen-vl-plus', 'qwen-vl-max', 'internlm/internlm-xcomposer2-vl-7b']
+
+
+    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+    Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.48s/it]
+
+
+## Construct prompts
+
+Prompts are the key interaction interface to VLMs. You can easily construct a prompt by call the Prompt API.
+
+
+```python
+# Prompt API supports a list, so you can pass multiple prompts at once.
+prompts = pb.Prompt([
+    "You are a helpful assistant. Here is the question:{question}\nANSWER:",
+    "USER:{question}\nANSWER:",
+])
+```
+
+## Perform evaluation using prompts, datasets, and models
+
+Finally, you can perform standard evaluation using the loaded prompts, datasets, and labels.
+
+
+```python
+from tqdm import tqdm
+for prompt in prompts:
+    preds = []
+    labels = []
+    for data in tqdm(dataset):
+        # process input
+        input_text = pb.InputProcess.basic_format(prompt, data)
+        input_images = data['images']
+        label = data['answer']
+        raw_pred = model(input_images, input_text)
+        # process output
+        pred = pb.OutputProcess.pattern_split(raw_pred, 'ANSWER:')
+        preds.append(pred)
+        labels.append(label)
+
+    # evaluate
+    score = pb.Eval.compute_cls_accuracy(preds, labels)
+    print(f"{score:.3f}, {repr(prompt)}")
+```
+
+      0%|          | 0/900 [00:00<?, ?it/s]
+
+    100%|██████████| 900/900 [17:35<00:00,  1.17s/it]  
+
+
+    0.333, 'You are a helpful assistant. Here is the question:{question}\nANSWER:'
+
+
+    100%|██████████| 900/900 [17:27<00:00,  1.16s/it]  
+
+    0.316, 'USER:{question}\nANSWER:'
+
+
+
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -24,6 +24,7 @@ Welcome to promptbench's documentation!
    :caption: Examples
 
    examples/basic
+   examples/multimodal
    examples/dyval
    examples/prompt_attack
    examples/prompt_engineering

diff --git a/docs/start/intro.md b/docs/start/intro.md
@@ -11,7 +11,7 @@
 
 ## Where should I get started?
 If you want to
-1. **evaluate my model on existing benchmarks:** please refer to the `examples/basic.ipynb` for constructing your evaluation pipeline.
+1. **evaluate my model on existing benchmarks:** please refer to the `examples/basic.ipynb` for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to `examples/multimodal.ipynb`.
 2. **test the effects of different prompting techniques:** 
 3. **examine the robustness for prompt attacks**, please refer to `examples/prompt_attack.ipynb` to construct the attacks.
 4. **use DyVal for evaluation:** please refer to `examples/dyval.ipynb` to construct DyVal datasets.