ChartMuseum is a chart question answering benchmark designed to evaluate reasoning capabilities of large vision-language models (LVLMs) over real-world chart images. We categorize the questions into four types:
Human overall accuracy on ChartMuseum is 93%, with 98.2% on the visual reasoning questions. Examples from ChartMuseum are available here.
Compare model performance across different metrics.
Model | Size | Visual | Synthesis | Visual/Text | Text | Overall |
---|---|---|---|---|---|---|
Gemini-2.5-Pro | - | 53.3 | 64.7 | 70.1 | 87.8 | 63.0 |
o4-mini (high) | - | 51.2 | 66.2 | 68.4 | 86.2 | 61.5 |
o3 (high) | - | 50.4 | 63.2 | 69.7 | 85.4 | 60.9 |
Claude-3.7-Sonnet | - | 50.6 | 55.6 | 69.2 | 88.6 | 60.3 |
GPT-4.1 | - | 37.1 | 53.4 | 54.3 | 78.9 | 48.4 |
Qwen2.5-VL-72B | 72B | 30.4 | 35.3 | 42.3 | 68.3 | 38.5 |
Bespoke-MiniChart-7B | 7B | 26.3 | 32.3 | 41.0 | 54.5 | 34.0 |
Qwen2.5-VL-7B | 7B | 19.4 | 24.8 | 36.3 | 41.5 | 26.8 |