feed.xml

<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.4">Jekyll</generator><link href="https://llava-vl.github.io/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://llava-vl.github.io/blog/" rel="alternate" type="text/html" /><updated>2024-02-06T23:45:47-06:00</updated><id>https://llava-vl.github.io/blog/feed.xml</id><title type="html">LLaVA</title><subtitle></subtitle><entry><title type="html">LLaVA-NeXT: Improved reasoning, OCR, and world knowledge</title><link href="https://llava-vl.github.io/blog/2024-01-30-llava-next/" rel="alternate" type="text/html" title="LLaVA-NeXT: Improved reasoning, OCR, and world knowledge" /><published>2024-01-30T12:33:38-06:00</published><updated>2024-01-30T12:33:38-06:00</updated><id>https://llava-vl.github.io/blog/llava-next</id><content type="html" xml:base="https://llava-vl.github.io/blog/2024-01-30-llava-next/"><![CDATA[<!-- for mathjax support -->

<script type="text/x-mathjax-config">
    MathJax.Hub.Config({
      tex2jax: {
        inlineMath: [ ['$','$'], ["\\(","\\)"] ],
        displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
        processEscapes: true
      },
      TeX: {
        equationNumbers: { autoNumber: "AMS" }
      }
    });
  </script>

<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>

<p>In October 2023, we released <a href="https://arxiv.org/abs/2310.03744">LLaVA-1.5</a> with a simple and efficient design along with great performance on a benchmark suite of 12 datasets. It has since served as the foundation of many comprehensive studies of data, model, and capabilities of large multimodal models (LMM), and has enabled various new applications.</p>

<p><strong>Today, we are thrilled to present LLaVA-NeXT, with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.</strong></p>

<p>Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:</p>

<ol>
  <li><strong>Increasing the input image resolution</strong> to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.</li>
  <li><strong>Better visual reasoning and OCR capability</strong> with an improved visual instruction tuning data mixture.</li>
  <li><strong>Better visual conversation for more scenarios</strong>, covering different applications. Better world knowledge and logical reasoning.</li>
  <li><strong>Efficient deployment and inference</strong> with <a href="https://github.com/sgl-project/sglang">SGLang</a>.</li>
</ol>

<p>Along with performance improvements, <strong>LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA-1.5</strong>. It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. The largest 34B variant finishes training in ~1 day with 32 A100s. <strong>Code, data, model will be made publicly available.</strong></p>

<h2 id="open-source-release">Open-Source Release</h2>

<p>We open-source the LLaVA-NeXT to facilitate future development of LMM in the community. Code, data, model will be made publicly available.</p>

<ul>
  <li><a href="https://llava.hliu.cc" target="_blank">Demo</a></li>
  <li><a href="https://github.com/haotian-liu/LLaVA" target="_blank">Code</a> (Training code coming soon)</li>
  <li><a href="https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md#llava-v16" target="_blank">Model</a></li>
  <li>Data: Coming soon.</li>
</ul>

<h2 id="results">Results</h2>

<div>
  <span style="display: inline-block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(249, 242, 248, 1);">Open-Source</span>
  <span style="display: inline-block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(117, 209, 215, 0.1);">Proprietary</span>
</div>

<table>
  <tr><th>Data (PT)</th><th>Data (IT)</th><th>Model</th><th>MMMU (val)</th><th>Math-Vista</th><th>MMB-ENG</th><th>MMB-CN</th><th>MM-Vet</th><th>LLaVA-Wild</th><th>SEED-IMG</th></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>GPT-4V</td><td>56.8</td><td>49.9</td><td>75.8</td><td>73.9</td><td>67.6</td><td>-</td><td>71.6</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>Gemini Ultra</td><td>59.4</td><td>53</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>Gemini Pro</td><td>47.9</td><td>45.2</td><td>73.6</td><td>74.3</td><td>64.3</td><td>-</td><td>70.7</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>1.4B</td><td>50M</td><td>Qwen-VL-Plus</td><td>45.2</td><td>43.3</td><td>-</td><td>-</td><td>55.7</td><td>-</td><td>65.7</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>1.5B</td><td>5.12M</td><td>CogVLM-30B</td><td>32.1</td><td>-</td><td>-</td><td>-</td><td>56.8</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>125M</td><td>~1M</td><td>Yi-VL-34B</td><td>45.9</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>665K</td><td>LLaVA-1.5-13B</td><td>36.4</td><td>27.6</td><td>67.8</td><td>63.3</td><td>36.3</td><td>72.5</td><td>68.2</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>760K</td><td>LLaVA-NeXT-34B</td><td>51.1</td><td>46.5</td><td>79.3</td><td>79</td><td>57.4</td><td>89.6</td><td>75.9</td></tr>
</table>

<details>
<summary><b><em>For more results, please unfold to see expanded tables</em></b></summary>

<i>More benchmarks will be added soon.</i>

<div>
  <span style="display: inline-block; padding: 4px 10px;">$^\dagger$specialist</span>
  <span style="display: inline-block; padding: 4px 10px; color: gray">Grey: supervised finetuned</span>
  <span style="display: inline-block; padding: 4px 10px;">$^*$training <u>image</u> observed</span>
  <span style="display: inline-block; padding: 4px 10px;">Normal text: zero-shot</span>
</div>

<table>
  <tr><th>Model</th><th>VQAv2</th><th>GQA</th><th>VisWiz</th><th>TextVQA</th><th>ScienceQA</th></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>GPT-4V</td><td>77.2</td><td>-</td><td>-</td><td>78.0</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>Gemini Ultra</td><td>77.8</td><td>-</td><td>-</td><td>82.3</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>Gemini Pro</td><td>71.2</td><td>-</td><td>-</td><td>74.6</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>PALI-X</td><td style="color: gray">86.0$^\dagger$</td><td>-</td><td style="color: gray">70.9$^\dagger$</td><td style="color: gray">71.4$^\dagger$</td><td>-</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>CogVLM-30B</td><td style="color: gray">83.4 (84.7$^\dagger$)</td><td style="color: gray">65.2$^\dagger$</td><td style="color: gray">76.4$^\dagger$</td><td style="color: gray">68.1 (69.3$^\dagger$)</td><td style="color: gray">92.7$^\dagger$</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>LLaVA-1.5-13B</td><td style="color: gray">80</td><td style="color: gray">63.3</td><td>53.6</td><td>61.3$^*$</td><td>71.6</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>LLaVA-NeXT-Vicuna-7B</td><td style="color: gray">81.8</td><td style="color: gray">64.2</td><td>57.6</td><td>64.9</td><td>70.1</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>LLaVA-NeXT-Vicuna-13B</td><td style="color: gray">82.8</td><td style="color: gray">65.4</td><td>60.5</td><td>67.1</td><td>73.6</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>LLaVA-NeXT-Mistral-7B</td><td style="color: gray">82.2</td><td style="color: gray">64.8</td><td>60.0</td><td>65.7</td><td>72.8</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>LLaVA-NeXT-34B</td><td style="color: gray">83.7</td><td style="color: gray">67.1</td><td>63.8</td><td>69.5</td><td>81.8</td></tr>
</table>


<table>
  <tr><th>Data (PT)</th><th>Data (IT)</th><th>Model</th><th>MMMU (val)</th><th>MMMU (test)</th><th>MathVista</th><th>MMB-ENG</th><th>MMB-CN</th><th>MM-Vet</th><th>LLaVA-Wild</th><th>SEED-IMG</th><th>MME</th><th>POPE</th></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>GPT-4V</td><td>56.8</td><td>55.7</td><td>49.9</td><td>75.8</td><td>73.9</td><td>67.6</td><td>-</td><td>71.6</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>Gemini Ultra</td><td>59.4</td><td>-</td><td>53</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>N/A</td><td>N/A</td><td>Gemini Pro</td><td>47.9</td><td>-</td><td>45.2</td><td>73.6</td><td>74.3</td><td>64.3</td><td>-</td><td>70.7</td><td>-</td><td>-</td></tr>
   <tr style="background-color: rgba(117, 209, 215, 0.1);"><td>1.4B</td><td>50M</td><td>Qwen-VL-Plus</td><td>45.2</td><td>40.8</td><td>43.3</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr> 
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>1.5B</td><td>5.12M</td><td>CogVLM-30B</td><td>32.1</td><td>30.1</td><td>-</td><td>-</td><td>-</td><td>56.8</td><td>-</td><td>-</td><td>-</td><td>-</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>665K</td><td>LLaVA-1.5-13B (336$^2$)</td><td>36.4</td><td>33.6</td><td>27.6</td><td>67.8</td><td>63.3</td><td>36.3</td><td>72.5</td><td>68.2</td><td>1531/295</td><td>85.93</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>760K</td><td>LLaVA-NeXT-Vicuna-7B (672$^2$)</td><td>35.8</td><td>-</td><td>34.6</td><td>67.4</td><td>60.6</td><td>43.9</td><td>81.6</td><td>70.2</td><td>1519/332</td><td>86.53</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>760K</td><td>LLaVA-NeXT-Mistral-7B (672$^2$)</td><td>35.3</td><td>-</td><td>37.7</td><td>68.7</td><td>61.2</td><td>47.3</td><td>83.2</td><td>72.2</td><td>1498/321</td><td>86.73</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>760K</td><td>LLaVA-NeXT-13B (672$^2$)</td><td>36.2</td><td>-</td><td>35.3</td><td>70</td><td>64.4</td><td>48.4</td><td>87.3</td><td>71.9</td><td>1575/326</td><td>86.23</td></tr>
  <tr style="background-color: rgba(249, 242, 248, 1);"><td>558K</td><td>760K</td><td>LLaVA-NeXT-34B (672$^2$)</td><td>51.1</td><td>44.7</td><td>46.5</td><td>79.3</td><td>79</td><td>57.4</td><td>89.6</td><td>75.9</td><td>1631/397</td><td>87.73</td></tr>
</table>

</details>

<div>&nbsp;</div>

<p>✨ Highlights:</p>
<ul>
  <li><strong>SoTA Performance!</strong> LLaVA-NeXT achieves the best performance compared with open-source LMMs such as <a href="https://github.com/THUDM/CogVLM" target="_blank">CogVLM</a> or <a href="https://huggingface.co/01-ai/Yi-VL-34B" target="_blank">Yi-VL</a>. Compared with commercial ones, it catches up to Gemini Pro and outperforms <a href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus" target="_blank">Qwen-VL-Plus</a> on selected benchmarks.</li>
  <li><strong>Zero-shot Chinese capability</strong>. LLaVA-NeXT’s Chinese capability is an emerging zero-shot capability (i.e., only English multimodal data is considered). Its performance on Chinese multimodal scenarios is surprisingly good, e.g., SoTA on MMBench-CN.</li>
  <li><strong>Low Training Cost</strong>. LLaVA-NeXT is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.</li>
</ul>

<h3 id="qualitative-results">Qualitative Results</h3>

<details>
<summary><b>Mark Zuckerberg talks about Llama-3 and 600K H100s</b></summary>

<div style="display: block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(117, 209, 215, 0.1);">

  <h4>User: Explain this to me in detail.</h4>

  <img width="90%" src="/blog/assets/images/llava-1-6/zuck_llama3.jpg" />

</div>

<div style="display: block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(249, 242, 248, 1);">

  <h4>LLaVA-NeXT-34B</h4>

  <p>The image you've provided appears to be a screenshot of a social media post, specifically from Instagram, featuring a man speaking and a text overlay. The man in the image is wearing a black t-shirt and appears to be in the middle of a conversation or presentation. The text overlay is a quote from a post by a user named "zuck," which is presumably a reference to Mark Zuckerberg, the CEO of Meta (formerly known as Facebook).</p>

  <p>The text overlay reads:</p>

  <p>"Some updates on our AI efforts. Our long term vision is to build general intelligence, open source it responsibly, and make it widely available so everyone can benefit. We're bringing our two major AI research efforts (FAIR and GenAI) closer together to support this. We're currently training our next-gen model Llama 3, and we're building massive compute infrastructure to support our future roadmap, including 35k H100s by the end of this year — and overall almost 600k H100s equivalents of compute if you include other GPUs. Also really excited about our progress building new AI-centric computing devices like Ray Ban Meta smart glasses. Lots more to come."</p>

  <p>This text provides insights into Meta's artificial intelligence (AI) initiatives. It mentions that the company's long-term goal is to develop general intelligence and make it accessible to everyone. The post also discusses the integration of Meta's two AI research efforts, FAIR (Facebook AI Research) and GenAI, to support this goal.</p>

  <p>The text also mentions the training of a next-generation AI model called Llama 3 and the construction of massive compute infrastructure to support Meta's future plans. It highlights the company's investment in NVIDIA's H100 GPUs, with a total of 35,000 H100s by the end of the year and almost 600,000 H100s equivalents of compute if other GPUs are included.</p>

  <p>Lastly, the post expresses excitement about Meta's progress in developing AI-centric computing devices, specifically Ray Ban Meta smart glasses. The post concludes with a statement that there is more to come, suggesting that Meta has additional AI-related announcements or developments in the pipeline.</p>

  <p>The image of the man speaking could be from a video or a live stream where he is discussing these AI updates. The context of the image suggests that it is part of a larger conversation or announcement related to Meta's AI strategy and advancements.</p>

</div>

</details>

<details>
<summary><b>When to leave to pickup based on flight info?</b></summary>

<div style="display: block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(117, 209, 215, 0.1);">

  <h4>User: I need to pick up my wife. I live in San Jose. When should I leave?</h4>

  <img width="90%" src="/blog/assets/images/llava-1-6/ewr_sfo_flight.png" />

</div>

<div style="display: block; padding: 4px 10px; margin: 5px; border-radius: 3px; background-color: rgba(249, 242, 248, 1);">

  <h4>LLaVA-NeXT-34B</h4>

  <p>Based on the information provided in the image, the flight is scheduled to arrive at 11:51 AM at San Francisco International Airport (SFO). If you live in San Jose, you should consider the travel time between San Jose and San Francisco, which is approximately 45-60 minutes depending on traffic conditions.</p>

  <p>To ensure you have enough time to pick up your wife, you should leave San Jose no later than 11:00 AM to account for traffic and any unexpected delays. However, it's always a good idea to leave earlier than necessary to ensure you arrive on time.</p>

</div>

</details>
<p><br /></p>

<h2 id="detailed-technical-improvement">Detailed Technical Improvement</h2>

<p>We detail the findings we have and improvements we make to LLaVA-1.5. <em>More implementation details will be released and documented here in the coming days.</em></p>

<h3 id="1-dynamic-high-resolution">(1) Dynamic High Resolution</h3>

<p>We design our model at high resolution with an aim to <strong>preserve its data efficiency</strong>.
When provided with high-resolution images and representations that preserve these details, the model’s capacity to perceive intricate details in an image is significantly improved. It reduces the model hallucination that conjectures the imagined visual content when confronted with low-resolution images. Our ‘AnyRes’ technique is designed to accommodate images of various high resolutions. We employ a grid configuration of $\{2 \times 2, 1 \times \{2,3,4\}, \{2,3,4\} \times 1\}$, balancing performance efficiency with operational costs. See our <a href="https://static.hliu.cc/files/llava/improved_llava.pdf">updated LLaVA-1.5 technical report</a> for more details.</p>

<p align="center">
  <img width="90%" src="/blog/assets/images/llava-1-6/high_res_arch_v2.png" />
  <br />
  <em>Illustration of dynamic high resolution scheme: a grid configuration of $\\{2 \times 2\\}$</em>
</p>

<h3 id="2-data-mixture">(2) Data Mixture</h3>

<ul>
  <li><strong>High-quality User Instruct Data</strong>. Our definition of high-quality visual instruction-following data hinges on two principal criteria: First, the diversity of task instructions, ensuring adequately represent a broad spectrum of user intents that are likely to be encountered in real-world scenarios, particularly during the model’s deployment phase. Second, the superiority of responses is critical, with the objective of soliciting favorable user feedback. To achieve this, we consider two data sources: (1) Existing GPT-V data. <a href="https://huggingface.co/datasets/laion/gpt4v-dataset">LAION-GPT-V</a> and <a href="https://sharegpt4v.github.io/\">ShareGPT-4V</a>. (2) To further facilitate better visual conversation for more scenarios, we collect a small 15K visual instruction tuning dataset covering different applications. The instructions and images come from <a href="https://llava-vl.github.io/">LLaVA demo</a>, which are real-world users requests. We carefully filter samples that may have privacy concerns or are potentially harmful, and generate the response with GPT-4V.</li>
  <li><strong>Multimodal Document/Chart Data</strong>. (1) We remove <a href="https://textvqa.org/textcaps/">TextCaps</a> from our training data as we realize that TextCaps uses the same set of training images as <a href="https://textvqa.org/">TextVQA</a>. This allows us to better understand the zero-shot OCR capability of our model when evaluating TextVQA during development. To maintain and further improve our model’s OCR capability, we replace TextCaps with DocVQA and SynDog-EN. (2) Motivated by <a href="https://huggingface.co/Qwen/Qwen-VL">Qwen-VL-7B-Chat</a>, we further add ChartQA, DVQA, and AI2D for better chart and diagram understanding.</li>
</ul>

<h3 id="3-scaling-llm-backbone">(3) Scaling LLM backbone</h3>

<p>In addition to Vicuna-1.5 (<a href="https://huggingface.co/lmsys/vicuna-7b-v1.5">7B</a> and <a href="https://huggingface.co/lmsys/vicuna-13b-v1.5">13B</a>), we consider more LLMs, including <a href="https://mistral.ai/news/announcing-mistral-7b/">Mistral-7B</a> and <a href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">Nous-Hermes-2-Yi-34B</a>. These LLMs possess nice properties, flexible commercial use terms, strong bilingual support and larger language model capacity. It allows LLaVA to support a wider spectrum of users and more scenarios in the community. The LLaVA recipe works well with various LLMs, and scales up smoothly with the LLM up to 34B.</p>

<h2 id="model-card">Model Card</h2>

<table>
  <tr><th colspan="2">Name</th><th>LLaVA-NeXT-7B</th><th>LLaVA-NeXT-13B</th><th>LLaVA-NeXT-34B</th></tr>
  <tr><th rowspan="4">Model Size</th><td>Total</td><td><b>7.06B</b></td><td><b>13.35B</b></td><td><b>34.75B</b></td></tr>
  <tr><td>Vision Encoder</td><td>303.5M</td><td>303.5M</td><td>303.5M</td></tr>
  <tr><td>Connector</td><td>21M</td><td>31.5M</td><td>58.7M</td></tr>
  <tr><td>LLM</td><td>6.74B</td><td>13B</td><td>34.39B</td></tr>
  <tr><th colspan="2">Resolution</th><td colspan="3">336 x [(2,2), (1,2), (2,1), (1,3), (3,1), (1,4), (4,1)]</td></tr>
  <tr><th>Stage-1</th><th>Training Data</th><td colspan="3">558K</td></tr>
  <tr><th></th><th>Trainable Module</th><td colspan="3">Connector</td></tr>
  <tr><th>Stage-2</th><th>Training Data</th><td colspan="3">760K</td></tr>
  <tr><th></th><th>Trainable Module</th><td colspan="3">Full model</td></tr>
  <tr><th colspan="2">Compute (#GPU x #Hours)</th><td>8x20</td><td>16x24</td><td>32x30</td></tr>
  <tr><th colspan="2">Training Data (#Samples)</th><td colspan="3">1318K</td></tr>
</table>

<h2 id="team">Team</h2>

<ul>
  <li><a href="https://hliu.cc/">Haotian Liu</a>: University of Wisconsin-Madison<img width="16" src="/blog/assets/images/logos/wisc.png" /></li>
  <li><a href="https://chunyuan.li/">Chunyuan Li</a>: Bytedance/Tiktok<img width="16" src="/blog/assets/images/logos/tiktok.png" />  (<em>Part of the work was done at Microsoft Research</em>)</li>
  <li><a href="https://yuheng-li.github.io/">Yuheng Li</a>: University of Wisconsin-Madison<img width="16" src="/blog/assets/images/logos/wisc.png" /></li>
  <li><a href="https://brianboli.com/">Bo Li</a>: Nanyang Technological University<img width="16" src="/blog/assets/images/logos/ntu.png" /> (<em>Work collaborated with ByteDance/TikTok</em>)</li>
  <li><a href="https://zhangyuanhan-ai.github.io/">Yuanhan Zhang</a>: Nanyang Technological University<img width="16" src="/blog/assets/images/logos/ntu.png" /> (<em>Work collaborated with ByteDance/TikTok</em>)</li>
  <li><a href="https://sincerass.github.io/">Sheng Shen</a>: University of California, Berkeley<img width="16" src="/blog/assets/images/logos/berkeley.png" /></li>
  <li><a href="https://pages.cs.wisc.edu/~yongjaelee/">Yong Jae Lee</a>: University of Wisconsin-Madison<img width="16" src="/blog/assets/images/logos/wisc.png" /></li>
</ul>

<h2 id="acknowledgement">Acknowledgement</h2>

<ul>
  <li>A16Z Open Source AI Grants Program.</li>
  <li>We thank Lianmin Zheng, Ying Sheng, Shiyi Cao for the integration of LLaVA to SGLang.</li>
  <li>This work was supported in part by NSF CAREER IIS2150012, Microsoft Accelerate Foundation Models Research, and Institute of Information &amp; communications Technology Planning &amp; Evaluation(IITP) grants funded by the Korea government(MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training).</li>
</ul>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@misc</span><span class="p">{</span><span class="nl">liu2024llavanext</span><span class="p">,</span>
    <span class="na">title</span><span class="p">=</span><span class="s">{LLaVA-NeXT: Improved reasoning, OCR, and world knowledge}</span><span class="p">,</span>
    <span class="na">url</span><span class="p">=</span><span class="s">{https://llava-vl.github.io/blog/2024-01-30-llava-next/}</span><span class="p">,</span>
    <span class="na">author</span><span class="p">=</span><span class="s">{Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae}</span><span class="p">,</span>
    <span class="na">month</span><span class="p">=</span><span class="s">{January}</span><span class="p">,</span>
    <span class="na">year</span><span class="p">=</span><span class="s">{2024}</span>
<span class="p">}</span>

<span class="nc">@misc</span><span class="p">{</span><span class="nl">liu2023improvedllava</span><span class="p">,</span>
      <span class="na">title</span><span class="p">=</span><span class="s">{Improved Baselines with Visual Instruction Tuning}</span><span class="p">,</span> 
      <span class="na">author</span><span class="p">=</span><span class="s">{Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae}</span><span class="p">,</span>
      <span class="na">publisher</span><span class="p">=</span><span class="s">{arXiv:2310.03744}</span><span class="p">,</span>
      <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span><span class="p">,</span>
<span class="p">}</span>

<span class="nc">@misc</span><span class="p">{</span><span class="nl">liu2023llava</span><span class="p">,</span>
      <span class="na">title</span><span class="p">=</span><span class="s">{Visual Instruction Tuning}</span><span class="p">,</span> 
      <span class="na">author</span><span class="p">=</span><span class="s">{Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae}</span><span class="p">,</span>
      <span class="na">publisher</span><span class="p">=</span><span class="s">{NeurIPS}</span><span class="p">,</span>
      <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, Yong Jae Lee</name></author><category term="announcement" /><summary type="html"><![CDATA[LLaVA team presents LLaVA-NeXT, with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.]]></summary></entry></feed>