AdaptVis:
Spatial Understanding in Vision-Language
Models Requires Adaptive Attention

1City University of Hong Kong 2Stanford University 3Northwestern University 4Hong Kong University of Science and Technology 5National University of Singapore 6Tel Aviv University 7Salesforce Research

Abstract

AdaptVis is a novel decoding method to enhance spatial reasoning in Vision Language Models (VLMs).

Spatial reasoning is a critical challenge for VLMs, with simple tasks like recognizing “under” or “behind” relationships often proving difficult. Understanding these challenges requires examining how image and text tokens interact within the model. Our analysis reveals that errors frequently arise from attention being directed towards irrelevant regions in the image. Furthermore, VLMs exhibit different attention patterns when handling familiar versus unfamiliar spatial relationships. To address this, we propose AdaptVis, which leverages inference-time confidence scores to refine attention distribution. When the model is confident, AdaptVis sharpens attention on relevant regions; when confidence is low, it broadens the attention to capture wider context.

AdaptVis achieves significant improvements, with up to a 50-point gain on spatial reasoning benchmarks like WhatsUp and VSR, all with negligible computational overhead.

AdaptVis Framework

The framework of AdaptVis. We adaptively intervene in the temperature of the attention logits of the image tokens.

1). For generations with high confidence, we trust the attention pattern and sharpen the attention distribution.

2). For generations with low confidence, we smoothen the attention distribution to broaden the context window for better concentration on the correct objects.

Figure 1

Examples: See how the attention pattern changes after our intervention

Examples for confident samples


Attention scores on the patches before and after our intervention in the 17th layer for the images in real datasets COCO and VG. We utilize a “sharpening” intervention method to enhance the original attention pattern. The highlighted areas remain largely consistent, with our method serving to reinforce the focus rather than significantly altering it.


Figure 1
Figure 2

Examples for inconfident samples


Attention scores on the patches before and after our intervention in the 17th layer for the images in synthetic datasets Cont A and Cont B. We employ a “smoothing” intervention method to expand the context length of the model’s focused area. From the figure, it is evident that the model’s focused position undergoes significant changes after our intervention.


Figure 1
Figure 2

Our experiment results

Blue rows show the baslines (greedy decoding). Yellow rows are ScalingVis results, pink rows are AdaptVis results.

Green and Red arrows show the performance gain and decrease compared with greedy decoding.

Model What's Up VSR
Controlled_A Acc Controlled_B Acc COCO_one COCO_two VQ_one VQ_two Exact Match F1 Score
LLAVA-1.5 60.3 73.1 53.0 58.2 35.9 40.8 62.4 51.3
+VCD 61.5 ↑1.2 73.4 ↑0.3 53.3 ↑0.3 58.2 35.8 ↓0.1 42.5 ↑1.7 62.4 50.6 ↓0.7
+Dola 61.2 ↑0.9 73.4 ↑0.3 53.7 ↑0.7 57.5 ↓0.7 36.2 ↑0.3 42.1 ↑1.3 62.8 ↑0.4 53.2 ↑1.9
+ScalingVis 64.5 ↑4.2 75.2 ↑2.1 53.6 ↑0.6 59.4 ↑1.2 42.7 ↑6.8 48.1 ↑7.3 64.9 ↑2.5 62.5 ↑11.2
+AdaptVis 84.9 ↑24.6 83.8 ↑10.7 53.6 ↑0.6 59.9 ↑1.7 42.7 ↑6.8 48.1 ↑7.3 65.0 ↑2.6 62.5 ↑11.2
LLAVA-1.6 48.2 63.0 59.7 41.8 31.6 7.3 58.8 29.4
+VCD 61.8 ↑13.6 65.4 ↑2.4 60.6 ↑0.9 44.9 ↑3.1 33.8 ↑2.2 11.6 ↑4.3 58.8 29.4
+Dola 48.2 62.7 ↓0.3 59.7 41.5 ↓0.3 31.5 ↓0.1 7.3 59.3 ↑0.5 31.2 ↑1.8
+ScalingVis 97.0 ↑48.8 73.4 ↑10.4 63.1 ↑3.4 47.7 ↑5.9 38.2 ↑6.6 14.6 ↑7.3 59.1 ↑0.3 30.6 ↑1.2
+AdaptVis 98.2 ↑50.0 73.4 ↑10.4 63.1 ↑3.4 47.7 ↑5.9 35.2 ↑3.6 17.2 ↑9.9 62.7 ↑3.9 39.3 ↑9.9

BibTeX

@misc{chen2025spatialreasoninghardvlms,
      title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas}, 
      author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li},
      year={2025},
      eprint={2503.01773},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.01773}, 
}