Model + Technique | MAEā | QWKā | Accā | Timeā |
---|---|---|---|---|
Proprietary Models | ||||
Gemini-2.5-Flash (Baseline) | 1.91 | 0.49 | 0.15 | 38.29 |
Baseline + CP | 1.23 | 0.89 | 0.33 | 45.88 |
Baseline + CP + CoT | 1.20 | 0.87 | 0.30 | 96.67 |
Baseline + CP + CoT + ICL | 1.20 | 0.88 | 0.32 | 107.32 |
Gemini-2.5-Pro (Baseline) | 2.00 | 0.49 | 0.15 | 40.13 |
Baseline + CP | 1.25 | 0.88 | 0.33 | 45.24 |
Baseline + CP + CoT | 1.15 | 0.88 | 0.31 | 95.43 |
Baseline + CP + CoT + ICL | 1.22 | 0.86 | 0.30 | 106.55 |
Qwen-VL-Plus (Baseline) | 0.63 | 0.55 | 0.13 | 8.81 |
Baseline + CP | 0.58 | 0.62 | 0.17 | 35.81 |
Baseline + CP + CoT | 0.62 | 0.59 | 0.16 | 37.40 |
Baseline + CP + CoT + ICL | 0.62 | 0.69 | 0.22 | 48.22 |
Qwen-VL-Max (Baseline) | 1.26 | 0.04 | 0.22 | 13.54 |
Baseline + CP | 0.96 | 0.05 | 0.16 | 14.20 |
Baseline + CP + CoT | 1.20 | -0.05 | 0.12 | 26.54 |
Baseline + CP + CoT + ICL | 1.02 | 0.07 | 0.18 | 22.18 |
GPT-5-Mini (Baseline) | 1.16 | 0.14 | 0.30 | 27.67 |
Baseline + CP | 1.23 | 0.07 | 0.25 | 40.49 |
Baseline + CP + CoT | 1.14 | 0.06 | 0.25 | 47.65 |
Baseline + CP + CoT + ICL | 1.24 | 0.05 | 0.25 | 53.34 |
Open-Source Models | ||||
InternVL3-8B (Baseline) | 0.62 | 0.56 | 0.14 | 6.55 |
Baseline + CP | 0.55 | 0.70 | 0.20 | 7.15 |
Baseline + CP + CoT | 0.54 | 0.70 | 0.21 | 11.88 |
Baseline + CP + CoT + ICL | 0.33 | 0.72 | 0.19 | 11.68 |
Qwen2.5-VL-7B (Baseline) | 1.87 | 0.51 | 0.14 | 11.98 |
Baseline + CP | 1.88 | 0.46 | 0.12 | 11.60 |
Baseline + CP + CoT | 1.76 | 0.58 | 0.18 | 16.66 |
Baseline + CP + CoT + ICL | 1.53 | 0.68 | 0.22 | 19.64 |
Performance comparison of prompting strategies. Arrows indicate if higher (ā) or lower (ā) values are better. CP: Contextual Prompting, CoT: Chain-of-Thought, ICL: In-Context Learning.