arXiv:2511.19526

Perceptual Taxonomy:
Evaluating Hierarchical
Scene Reasoning in VLMs

A structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties to support goal-directed reasoning.

Jonathan Lee Xingrui Wang Jiawei Peng Luoxin Ye Zehan Zheng Tiezheng Zhang Tao Wang Wufei Ma Siyi Chen Yu-Cheng Chou Prakhar Kaushik Alan Yuille
28K+ Questions
5,802 Images
84 Attributes
3,173 Object Classes
Scene Context
Object Recognition
Material
Affordance
Function
Physical Props

We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties, such as material, affordance, and function to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability.

To address this gap, we introduce PercepTax, a benchmark for physically grounded visual reasoning. We annotate 3,173 objects with four property families covering 84 fine-grained attributes. Based on these annotations, we construct a multiple-choice question benchmark with 5,802 images spanning both synthetic and real domains, including 28,033 template-based questions and 50 expert-crafted questions.

Experimental results reveal that leading VLMs perform well on recognition tasks but degrade significantly by 10-20% on property-driven questions, especially those requiring multi-step reasoning over structured attributes.

Four Pillars of Perceptual Understanding

🧱

Material

The object's visual and physical substance (e.g., paper, metal, wood), determining rigidity, weight, texture, and durability.

17 types annotated

⚖️

Physical Properties

Intrinsic physical characteristics governing behavior under external forces: rigidity, fragility, elasticity, mass, stability.

18 properties defined

🤲

Affordance

Actions or interactions the object enables based on shape and components: graspable, supportable, containable.

26 affordances mapped

🎯

Function

The object's intended or contextual role: seating furniture, storage container, display signage, and more.

23 functions categorized

Four Question Types for Complete Evaluation

01 Object Description
02 Spatial Reasoning
03 Attribute Matching
04 Taxonomy Reasoning

From Recognition to Reasoning

Our benchmark systematically evaluates the full spectrum of perceptual taxonomy, from basic object identification to complex multi-step reasoning over physical properties.

  • Object Description: Match descriptions to color-coded bounding boxes based on material, properties, affordances, or function
  • Spatial Reasoning: Identify left/right, above/below, front/behind relations between objects
  • Attribute Matching: Associate specific physical or functional attributes with correct objects
  • Taxonomy Reasoning: Multi-step inference combining affordances, materials, and functional compatibility

Taxonomy Reasoning in Action

🏢 Office Scene with Multiple Objects
(monitor, bin, notebook, vending machine)

Question

Which object can be used as a shield? Imagine a sudden impact toward you, what item in this scene could you hold to defend yourself?

Taxonomy Reasoning

A good improvised shield should have: a broad and flat surface to cover the torso or head, be rigid and durable enough to withstand impact, and be light enough to move quickly.

The notebook has a flat, rigid surface that is portable. The vending machine has a large rigid surface but is far too heavy to move. Therefore, the notebook could be repurposed as a shield.

Answer: Notebook (Green Box)

Revealing the Reasoning Gap

Model Spatial Obj. Desc. Attr. Match Tax. Reason. Overall
Gemini 2.5 Pro Closed 73.32% 92.16% 81.01% 63.40% 77.97%
GPT-5 Closed 74.58% 88.63% 78.42% 59.18% 70.70%
Claude Sonnet 4.5 Closed 48.80% 65.21% 52.32% 40.26% 51.65%
Qwen3-VL-32B Open 74.30% 84.31% 74.19% 48.78% 64.07%
InternVL3.5-30B Open 72.16% 83.13% 67.12% 37.97% 56.15%
Qwen3-VL-8B Open 73.02% 85.10% 69.44% 45.26% 60.04%

Recognition vs. Reasoning Gap

Even GPT-5 drops from 88.63% on object description to 59.18% on taxonomy reasoning—a 29+ point decline revealing weak property-level understanding.

Sim-to-Real Transfer

PercepTax ICL improves Gemini from 14 to 34 correct expert questions, demonstrating that hierarchical reasoning learned from simulation transfers to real-world scenarios.

Open-Source Progress

Qwen3-VL-32B achieves 74.30% on spatial reasoning, surpassing closed-source models and showing strong open-source competitiveness.

Cite Our Work

BibTeX
@article{lee2024perceptax,
  title={Perceptual Taxonomy: Evaluating and Guiding 
         Hierarchical Scene Reasoning in Vision-Language Models},
  author={Lee, Jonathan and Wang, Xingrui and Peng, Jiawei and 
          Ye, Luoxin and Zheng, Zehan and Zhang, Tiezheng and 
          Wang, Tao and Ma, Wufei and Chen, Siyi and 
          Chou, Yu-Cheng and Kaushik, Prakhar and Yuille, Alan},
  journal={arXiv preprint arXiv:2511.19526},
  year={2024}
}