ChartMuseum Logo

ChartMuseum Leaderboard

NeurIPS 2025
ChartMuseum Overview

ChartMuseum is a chart question answering benchmark designed to evaluate reasoning capabilities of large vision-language models (LVLMs) over real-world chart images. We categorize the questions into four types:

  • Textual reasoning questions can be solved almost exclusively with textual reasoning.
  • Visual reasoning questions are most easily answerable from visual aspects of the chart.
  • Text/Visual reasoning questions can be answered by either primarily text or primarily visual reasoning.
  • Synthesis reasoning questions require both textual and visual reasoning.

Human overall accuracy on ChartMuseum is 93%, with 98.2% on the visual reasoning questions. Examples from ChartMuseum are available here.

Model Comparison

Compare model performance across different metrics.

12 / 25

selected / available models
GPT-5-mini (high)
Gemini-2.5-Pro
GPT-5 (high)
o4-mini (high)
o3 (high)
Claude-3.7-Sonnet
Claude-4.1-Opus
Claude-4-Sonnet
GPT-4.1
Qwen2.5-VL-72B
Bespoke-MiniChart-7B
Qwen2.5-VL-7B
ModelSize
Visual
Synthesis
Visual/Text
Text
Overall
GPT-5-mini (high)-52.662.473.589.463.3
Gemini-2.5-Pro-53.364.770.187.863.0
GPT-5 (high)-53.764.768.488.662.9
o4-mini (high)-51.266.268.486.261.5
o3 (high)-50.463.269.785.460.9
Claude-3.7-Sonnet-50.655.669.288.660.3
Claude-4.1-Opus-50.454.166.287.059.1
Claude-4-Sonnet-41.052.662.482.152.6
GPT-4.1-37.153.454.378.948.4
Qwen2.5-VL-72B72B30.435.342.368.338.5
Bespoke-MiniChart-7B7B26.332.341.054.534.0
Qwen2.5-VL-7B7B19.424.836.341.526.8