ChartMuseum Logo

ChartMuseum Leaderboard

ChartMuseum Overview

ChartMuseum is a chart question answering benchmark designed to evaluate reasoning capabilities of large vision-language models (LVLMs) over real-world chart images. We categorize the questions into four types:

  • Textual reasoning questions can be solved almost exclusively with textual reasoning.
  • Visual reasoning questions are most easily answerable from visual aspects of the chart.
  • Text/Visual reasoning questions can be answered by either primarily text or primarily visual reasoning.
  • Synthesis reasoning questions require both textual and visual reasoning.

Human overall accuracy on ChartMuseum is 93%, with 98.2% on the visual reasoning questions. Examples from ChartMuseum are available here.

Model Comparison

Compare model performance across different metrics.

8 / 21

selected / available models
Gemini-2.5-Pro
o4-mini (high)
o3 (high)
Claude-3.7-Sonnet
GPT-4.1
Qwen2.5-VL-72B
Bespoke-MiniChart-7B
Qwen2.5-VL-7B
ModelSize
Visual
Synthesis
Visual/Text
Text
Overall
Gemini-2.5-Pro-53.364.770.187.863.0
o4-mini (high)-51.266.268.486.261.5
o3 (high)-50.463.269.785.460.9
Claude-3.7-Sonnet-50.655.669.288.660.3
GPT-4.1-37.153.454.378.948.4
Qwen2.5-VL-72B72B30.435.342.368.338.5
Bespoke-MiniChart-7B7B26.332.341.054.534.0
Qwen2.5-VL-7B7B19.424.836.341.526.8