-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
253 lines (239 loc) · 15.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
<!DOCTYPE html>
<html lang="en">
<head>
<title>HumanEval-V</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script src="https://kit.fontawesome.com/f8ddf9854a.js" crossorigin="anonymous"></script>
<meta charset="utf-8">
<meta name="description"
content="A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Code Generation Tasks">
<meta name="keywords" content="Large Multimodal Model, Code Generation, Vision Language Model, Large Language Model">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title> HumanEval-V: A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks</title>
<link rel="icon" href="static/images/icon.png">
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<script src="https://kit.fontawesome.com/eaf1856e6f.js" crossorigin="anonymous"></script>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title is-bold">
<img src="static/images/icon.png" style="width:1em;vertical-align: middle" alt="Logo"/>
<span style="vertical-align: middle">HumanEval-V</span>
</h1>
<h2 class="subtitle is-3 publication-subtitle">
A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating LMMs through Coding Tasks
</h2>
<div class="is-size-5 publication-authors">
<span class="author-block">Fengji Zhang*<sup style="color:#6fbf73;">†,1</sup></a>,</span>
<span class="author-block">
Linquan Wu*<sup style="color:#6fbf73;">1</sup></a>,
</span>
<span class="author-block">
Bai Huiyu*<sup style="color:#6fbf73;">1</sup></a>,
</span>
<span class="author-block">Guancheng Lin*<sup style="color:#007bff;">2</sup>,</span><br>
<span class="author-block">Xiao Li<sup style="color:#ffac33;">3</sup>,</span>
<span class="author-block">Xiao Yu<sup style="color:#ed4b82;">4</sup>,</span>
<span class="author-block">Yue Wang<sup style="color:#9b51e0;">5</sup>,</span>
<span class="author-block">Bei Chen<sup style="color:#9b51e0;">5</sup>,</span>
<span class="author-block">Jacky Keung<sup style="color:#6fbf73;">1</sup></span>
</div>
<br>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup style="color:#6fbf73;">1</sup>CityU Hong Kong,</span>
<span class="author-block"><sup style="color:#007bff;">2</sup>Wuhan University,</span>
<span class="author-block"><sup style="color:#ffac33;">3</sup>Tsinghua University,</span>
<span class="author-block"><sup style="color:#ed4b82;">4</sup>Zhejiang University,</span>
<span class="author-block"><sup style="color:#9b51e0;">5</sup>Rhymes AI</span>
</div>
<br>
<div class="is-size-5 publication-authors">
<span class="author-block">*Core Contributors</span><br>
<span class="author-block">†Corresponding to:</span>
<span class="author-block"><a href="mailto:[email protected]">[email protected]</a></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://arxiv.org/abs/2410.12381" class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>arXiv</span>
</a>
</span> 
<span class="link-block">
<a href="https://github.com/HumanEval-V/HumanEval-V-Benchmark" class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span> 
<span class="link-block">
<a href="#leaderboard" class="external-link button is-normal is-rounded is-dark">
<span class="icon" style="font-size:18px">🏆</span>
<span>Leaderboard</span>
</a>
</span> 
<span class="link-block">
<a href="https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark" class="external-link button is-normal is-rounded is-dark">
<span class="icon" style="font-size:18px">🤗</span>
<span>Dataset</span>
</a>
</span> 
<span class="link-block">
<a href="https://huggingface.co/spaces/HumanEval-V/HumanEval-V-Benchmark-Viewer" class="external-link button is-normal is-rounded is-dark">
<span class="icon" style="font-size:18px">🤗</span>
<span>Dataset Viewer</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="hero teaser">
<div class="container is-max-desktop has-text-centered">
<img src="static/images/introduction_example.png" alt="HumanEval-V Coding Task" width="80%"/>
<p>An example coding task in HumanEval-V. Each task involves completing a <b>Python function</b> based on<br> <b>a single image, the function signature, and problem descriptions</b> provided in the comment block.
</p>
</div>
</section>
<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">🔔News</h2>
<div class="content has-text-justified">
<ul>
<li><b>[2024.11.20]</b> <b>Pixtral-Large-Instruct-2411</b> achieves <b>new open-weight SOTA</b> on HumanEval-V with <b>11.1</b> <i>pass@1</i> and <b>26.9</b> <i>pass@10</i> !</li>
<li><b>[2024.10.23]</b> <b>Claude 3.5 Sonnet (1022)</b> achieves <b>new SOTA</b> on HumanEval-V with <b>25.9</b> <i>pass@1</i> and <b>42.6</b> <i>pass@10</i> !</li>
<li><b>[2024.10.23]</b> We include more SOTA LMMs in the learderboard, including <a href="https://huggingface.co/mistralai/Pixtral-12B-2409">Pixtral</a>, <a href="https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct">Llama-3.2-Vision</a>, <a href="https://huggingface.co/rhymes-ai/Aria">Aria</a>, and <a href="https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B">Ovis1.6-Gemma2</a></li>
<li><b>[2024.10.23]</b> We've updated the <i>Parsing Success Rate</i> metric (passing <i>Pylint</i> without errors). Many LMMs perform poorly, indicating a decline in generating syntactically correct code.</li>
<li><b>[2024.10.17]</b> Our paper is now accessible at <a href="https://huggingface.co/papers/2410.12381">huggingface.co/papers/2410.12381</a> (<b>#2</b> Paper of the day)</li>
</ul>
</div>
<h2 class="title is-3">Introduction</h2>
<div class="content has-text-justified">
<p>
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of <b>high-level instructions, complex reasoning, and the implementation of functional programs</b> -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains <b>a notable lack of coding benchmarks that rigorously assess LMMs, particularly in tasks that emphasize visual reasoning</b>.
To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation tasks. <font color="#9900FF"><b>HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks</b></font> derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage.
LMMs are required to <font color="#9900FF"><b>complete the code solution based on the provided visual context and a predefined Python function signature</b></font> outlining the task requirements. Every task is equipped with meticulously <font color="#9900FF"><b>handcrafted test cases for execution-based pass@<i>k</i> evaluation</b></font>.
We evaluate 20+ state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like <b>GPT-4o achieve only 13% pass@1 and 36.4% pass@10</b>, while <b>open-weight models with 70B parameters score below 4% pass@1</b>. Ablation studies further demonstrate the limitations of current LMMs in vision reasoning and coding abilities.
These results highlight key areas for future research to enhance LMMs' capabilities.
</p>
</div>
<h2 class="title is-3">Benchmark Construction</h2>
<div class="container is-max-desktop has-text-centered">
<img src="static/images/construct_pipeline.png" alt="HumanEval-V Construction">
<p>The construction of HumanEval-V follows a <b>collect-adapt-mutate</b> pipeline. After constructing the benchmark, we perform rigorous validation to ensure that each coding task <b>aligns with the standards</b>.
</p>
</div>
<br>
<h2 class="title is-3">Visual Context Examples</h2>
<div class="container is-max-desktop has-text-centered">
<img src="static/images/concatenated_images.png" alt="HumanEval-V Examples">
<p>HumanEval-V includes visual elements like trees, graphs, matrices, maps, grids, flowcharts, and more. The visual contexts are designed to be <b>indispensable and self-explanatory</b>, embedding rich contextual information and algorithmic patterns.
</p>
</div>
</div>
</div>
</div>
<div class="container">
<!-------------------------------------------------------------------- RESULTS SECTION -------------------------------------------------------------------->
<div class="columns is-centered m-6">
<div class="column is-full has-text-centered content">
<h2 class="title is-3" id="leaderboard">🏆Leaderboard on HumanEval-V🏆</h2>
<div class="model-labels-container">
<span class="leaderboard-label open_source">Open-Weight</span>
<span class="leaderboard-label proprietary">Proprietary</span>
</div>
<br>
<div class="content has-text-centered content">
<p class="test-desc">
The best performance is shown in <b>bold</b>, while the second-best is indicated by <u>underlining</u>.
You can sort <i>Pass@1</i> or <i>Pass@10</i> by clicking on the column headers.<br>
</p>
</div>
<div class="leaderboard-container">
<div class="table-wrapper">
<table id="HEV-table">
<thead>
<tr>
<th class="sortable clickable" data-sort="string">Name</th>
<th class="sortable clickable" data-sort="string">Source</th>
<th class="sortable clickable" data-sort="string">Design</th>
<th class="sortable clickable" data-sort="string">Size</th>
<th class="sortable clickable sort-desc" data-sort="number">Pass@1</th>
<th class="sortable clickable" data-sort="number">Pass@10</th>
<th class="sortable clickable" data-sort="number">PSR@1</th>
<th class="sortable clickable" data-sort="number">PSR@10</th>
<th class="sortable clickable" data-sort="date">Date</th>
</tr>
</thead>
<tbody>
<!-- Table body will be populated dynamically -->
</tbody>
</table>
</div>
</div>
<div class="content has-text-left content">
<p class="test-desc">
<ul>
<li><i>Pass@1</i> is computed using greedy decoding, whereas <i>Pass@10</i> is based on 20 generated samples with sampling parameters of <i>t=0.8</i> and <i>p=0.95</i>.</li>
<li><i>PSR</i> represents the percentage of samples that pass Pylint checks without any errors.</li>
<li>The <i>Encoder-Decoder</i> design signifies that the model is trained using a vision encoder and a language model decoder.</li>
</ul>
</p>
</div>
</div>
</div>
</div>
</section>
<!-- @PAN TODO: bibtex -->
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title is-3 has-text-centered">BibTeX</h2>
<pre><code>
@article{zhang2024humanevalv,
title={HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks},
author={Zhang, Fengji and Wu, Linquan and Bai, Huiyu and Lin, Guancheng and Li, Xiao and Yu, Xiao and Wang, Yue and Chen, Bei and Keung, Jacky},
journal={arXiv preprint arXiv:2410.12381},
year={2024},
}
</code></pre>
</div>
</section>
<footer class="footer">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is website adapted from <a href="https://nerfies.github.io/">Nerfies</a> and <a href="https://mmmu-benchmark.github.io/">MMMU</a>, licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</footer>
</body>
</html>