ChartMuseum Leaderboard

NeurIPS 2025

ChartMuseum is a chart question answering benchmark designed to evaluate reasoning capabilities of large vision-language models (LVLMs) over real-world chart images. We categorize the questions into four types:

Textual reasoning questions can be solved almost exclusively with textual reasoning.
Visual reasoning questions are most easily answerable from visual aspects of the chart.
Text/Visual reasoning questions can be answered by either primarily text or primarily visual reasoning.
Synthesis reasoning questions require both textual and visual reasoning.

Human overall accuracy on ChartMuseum is 93%, with 98.2% on the visual reasoning questions. Examples from ChartMuseum are available here.

Model Comparison

Compare model performance across different metrics.

12 / 25

selected / available models

GPT-5-mini (high)

Gemini-2.5-Pro

GPT-5 (high)

o4-mini (high)

o3 (high)

Claude-3.7-Sonnet

Claude-4.1-Opus

Claude-4-Sonnet

GPT-4.1

Qwen2.5-VL-72B

Bespoke-MiniChart-7B

Qwen2.5-VL-7B

Model	Size	Visual	Synthesis	Visual/Text	Text	Overall
GPT-5-mini (high)	-	52.6	62.4	73.5	89.4	63.3
Gemini-2.5-Pro	-	53.3	64.7	70.1	87.8	63.0
GPT-5 (high)	-	53.7	64.7	68.4	88.6	62.9
o4-mini (high)	-	51.2	66.2	68.4	86.2	61.5
o3 (high)	-	50.4	63.2	69.7	85.4	60.9
Claude-3.7-Sonnet	-	50.6	55.6	69.2	88.6	60.3
Claude-4.1-Opus	-	50.4	54.1	66.2	87.0	59.1
Claude-4-Sonnet	-	41.0	52.6	62.4	82.1	52.6
GPT-4.1	-	37.1	53.4	54.3	78.9	48.4
Qwen2.5-VL-72B	72B	30.4	35.3	42.3	68.3	38.5
Bespoke-MiniChart-7B	7B	26.3	32.3	41.0	54.5	34.0
Qwen2.5-VL-7B	7B	19.4	24.8	36.3	41.5	26.8

ChartMuseum Leaderboard

Benchmark Details

Model Comparison

12 / 25

Team

Citation

ChartMuseum Leaderboard

Benchmark Details[expand]

Benchmark Details

Model Comparison

12 / 25

Team[expand]

Team

Citation[expand]

Citation