AdaptVis: Spatial Understanding in Vision-Language Models Requires Adaptive Attention

AdaptVis:
Spatial Understanding in Vision-Language
Models Requires Adaptive Attention

¹City University of Hong Kong ²Stanford University ³Northwestern University ⁴Hong Kong University of Science and Technology ⁵National University of Singapore ⁶Tel Aviv University ⁷Salesforce Research

Abstract

AdaptVis is a novel decoding method to enhance spatial reasoning in Vision Language Models (VLMs).

Spatial reasoning is a critical challenge for VLMs, with simple tasks like recognizing “under” or “behind” relationships often proving difficult. Understanding these challenges requires examining how image and text tokens interact within the model. Our analysis reveals that errors frequently arise from attention being directed towards irrelevant regions in the image. Furthermore, VLMs exhibit different attention patterns when handling familiar versus unfamiliar spatial relationships. To address this, we propose AdaptVis, which leverages inference-time confidence scores to refine attention distribution. When the model is confident, AdaptVis sharpens attention on relevant regions; when confidence is low, it broadens the attention to capture wider context.

AdaptVis achieves significant improvements, with up to a 50-point gain on spatial reasoning benchmarks like WhatsUp and VSR, all with negligible computational overhead.

AdaptVis Framework

The framework of AdaptVis. We adaptively intervene in the temperature of the attention logits of the image tokens.

1). For generations with high confidence, we trust the attention pattern and sharpen the attention distribution.

2). For generations with low confidence, we smoothen the attention distribution to broaden the context window for better concentration on the correct objects.

Examples: See how the attention pattern changes after our intervention

Examples for confident samples

Attention scores on the patches before and after our intervention in the 17th layer for the images in real datasets COCO and VG. We utilize a “sharpening” intervention method to enhance the original attention pattern. The highlighted areas remain largely consistent, with our method serving to reinforce the focus rather than significantly altering it.

Examples for inconfident samples

Attention scores on the patches before and after our intervention in the 17th layer for the images in synthetic datasets Cont A and Cont B. We employ a “smoothing” intervention method to expand the context length of the model’s focused area. From the figure, it is evident that the model’s focused position undergoes significant changes after our intervention.

Our experiment results

Blue rows show the baslines (greedy decoding). Yellow rows are ScalingVis results, pink rows are AdaptVis results.

Green and Red arrows show the performance gain and decrease compared with greedy decoding.

Model	What's Up						VSR
Model	Controlled_A Acc	Controlled_B Acc	COCO_one	COCO_two	VQ_one	VQ_two	Exact Match	F1 Score
LLAVA-1.5	60.3	73.1	53.0	58.2	35.9	40.8	62.4	51.3
+VCD	61.5 ↑1.2	73.4 ↑0.3	53.3 ↑0.3	58.2	35.8 ↓0.1	42.5 ↑1.7	62.4	50.6 ↓0.7
+Dola	61.2 ↑0.9	73.4 ↑0.3	53.7 ↑0.7	57.5 ↓0.7	36.2 ↑0.3	42.1 ↑1.3	62.8 ↑0.4	53.2 ↑1.9
+ScalingVis	64.5 ↑4.2	75.2 ↑2.1	53.6 ↑0.6	59.4 ↑1.2	42.7 ↑6.8	48.1 ↑7.3	64.9 ↑2.5	62.5 ↑11.2
+AdaptVis	84.9 ↑24.6	83.8 ↑10.7	53.6 ↑0.6	59.9 ↑1.7	42.7 ↑6.8	48.1 ↑7.3	65.0 ↑2.6	62.5 ↑11.2
LLAVA-1.6	48.2	63.0	59.7	41.8	31.6	7.3	58.8	29.4
+VCD	61.8 ↑13.6	65.4 ↑2.4	60.6 ↑0.9	44.9 ↑3.1	33.8 ↑2.2	11.6 ↑4.3	58.8	29.4
+Dola	48.2	62.7 ↓0.3	59.7	41.5 ↓0.3	31.5 ↓0.1	7.3	59.3 ↑0.5	31.2 ↑1.8
+ScalingVis	97.0 ↑48.8	73.4 ↑10.4	63.1 ↑3.4	47.7 ↑5.9	38.2 ↑6.6	14.6 ↑7.3	59.1 ↑0.3	30.6 ↑1.2
+AdaptVis	98.2 ↑50.0	73.4 ↑10.4	63.1 ↑3.4	47.7 ↑5.9	35.2 ↑3.6	17.2 ↑9.9	62.7 ↑3.9	39.3 ↑9.9

BibTeX

@misc{chen2025spatialreasoninghardvlms, title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas}, author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li}, year={2025}, eprint={2503.01773}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.01773}, }