AdaptVis is a novel decoding method to enhance spatial reasoning in Vision Language Models (VLMs).
Spatial reasoning is a critical challenge for VLMs, with simple tasks like recognizing “under” or “behind” relationships often proving difficult. Understanding these challenges requires examining how image and text tokens interact within the model. Our analysis reveals that errors frequently arise from attention being directed towards irrelevant regions in the image. Furthermore, VLMs exhibit different attention patterns when handling familiar versus unfamiliar spatial relationships. To address this, we propose AdaptVis, which leverages inference-time confidence scores to refine attention distribution. When the model is confident, AdaptVis sharpens attention on relevant regions; when confidence is low, it broadens the attention to capture wider context.
AdaptVis achieves significant improvements, with up to a 50-point gain on spatial reasoning benchmarks like WhatsUp and VSR, all with negligible computational overhead.
The framework of AdaptVis. We adaptively intervene in the temperature of the attention logits of the image tokens.
1). For generations with high confidence, we trust the attention pattern and sharpen the attention distribution.
2). For generations with low confidence, we smoothen the attention distribution to broaden the context window for better concentration on the correct objects.
Attention scores on the patches before and after our intervention in the 17th layer for the images in real datasets COCO and VG. We utilize a “sharpening” intervention method to enhance the original attention pattern. The highlighted areas remain largely consistent, with our method serving to reinforce the focus rather than significantly altering it.
Attention scores on the patches before and after our intervention in the 17th layer for the images in synthetic datasets Cont A and Cont B. We employ a “smoothing” intervention method to expand the context length of the model’s focused area. From the figure, it is evident that the model’s focused position undergoes significant changes after our intervention.
Blue rows show the baslines (greedy decoding). Yellow rows are ScalingVis results, pink rows are AdaptVis results.
Green and Red arrows show the performance gain and decrease compared with greedy decoding.
Model | What's Up | VSR | ||||||
---|---|---|---|---|---|---|---|---|
Controlled_A Acc | Controlled_B Acc | COCO_one | COCO_two | VQ_one | VQ_two | Exact Match | F1 Score | |
LLAVA-1.5 | 60.3 | 73.1 | 53.0 | 58.2 | 35.9 | 40.8 | 62.4 | 51.3 |
+VCD | 61.5 ↑1.2 | 73.4 ↑0.3 | 53.3 ↑0.3 | 58.2 | 35.8 ↓0.1 | 42.5 ↑1.7 | 62.4 | 50.6 ↓0.7 |
+Dola | 61.2 ↑0.9 | 73.4 ↑0.3 | 53.7 ↑0.7 | 57.5 ↓0.7 | 36.2 ↑0.3 | 42.1 ↑1.3 | 62.8 ↑0.4 | 53.2 ↑1.9 |
+ScalingVis | 64.5 ↑4.2 | 75.2 ↑2.1 | 53.6 ↑0.6 | 59.4 ↑1.2 | 42.7 ↑6.8 | 48.1 ↑7.3 | 64.9 ↑2.5 | 62.5 ↑11.2 |
+AdaptVis | 84.9 ↑24.6 | 83.8 ↑10.7 | 53.6 ↑0.6 | 59.9 ↑1.7 | 42.7 ↑6.8 | 48.1 ↑7.3 | 65.0 ↑2.6 | 62.5 ↑11.2 |
LLAVA-1.6 | 48.2 | 63.0 | 59.7 | 41.8 | 31.6 | 7.3 | 58.8 | 29.4 |
+VCD | 61.8 ↑13.6 | 65.4 ↑2.4 | 60.6 ↑0.9 | 44.9 ↑3.1 | 33.8 ↑2.2 | 11.6 ↑4.3 | 58.8 | 29.4 |
+Dola | 48.2 | 62.7 ↓0.3 | 59.7 | 41.5 ↓0.3 | 31.5 ↓0.1 | 7.3 | 59.3 ↑0.5 | 31.2 ↑1.8 |
+ScalingVis | 97.0 ↑48.8 | 73.4 ↑10.4 | 63.1 ↑3.4 | 47.7 ↑5.9 | 38.2 ↑6.6 | 14.6 ↑7.3 | 59.1 ↑0.3 | 30.6 ↑1.2 |
+AdaptVis | 98.2 ↑50.0 | 73.4 ↑10.4 | 63.1 ↑3.4 | 47.7 ↑5.9 | 35.2 ↑3.6 | 17.2 ↑9.9 | 62.7 ↑3.9 | 39.3 ↑9.9 |
@misc{chen2025spatialreasoninghardvlms,
title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas},
author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li},
year={2025},
eprint={2503.01773},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.01773},
}