A structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties to support goal-directed reasoning.
Abstract
We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties, such as material, affordance, and function to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability.
To address this gap, we introduce PercepTax, a benchmark for physically grounded visual reasoning. We annotate 3,173 objects with four property families covering 84 fine-grained attributes. Based on these annotations, we construct a multiple-choice question benchmark with 5,802 images spanning both synthetic and real domains, including 28,033 template-based questions and 50 expert-crafted questions.
Experimental results reveal that leading VLMs perform well on recognition tasks but degrade significantly by 10-20% on property-driven questions, especially those requiring multi-step reasoning over structured attributes.
Property Taxonomy
The object's visual and physical substance (e.g., paper, metal, wood), determining rigidity, weight, texture, and durability.
17 types annotated
Intrinsic physical characteristics governing behavior under external forces: rigidity, fragility, elasticity, mass, stability.
18 properties defined
Actions or interactions the object enables based on shape and components: graspable, supportable, containable.
26 affordances mapped
The object's intended or contextual role: seating furniture, storage container, display signage, and more.
23 functions categorized
Benchmark Design
Our benchmark systematically evaluates the full spectrum of perceptual taxonomy, from basic object identification to complex multi-step reasoning over physical properties.
Example
Question
Which object can be used as a shield? Imagine a sudden impact toward you, what item in this scene could you hold to defend yourself?
Taxonomy Reasoning
A good improvised shield should have: a broad and flat surface to cover the torso or head, be rigid and durable enough to withstand impact, and be light enough to move quickly.
The notebook has a flat, rigid surface that is portable. The vending machine has a large rigid surface but is far too heavy to move. Therefore, the notebook could be repurposed as a shield.
Experimental Results
| Model | Spatial | Obj. Desc. | Attr. Match | Tax. Reason. | Overall |
|---|---|---|---|---|---|
| Gemini 2.5 Pro Closed | 73.32% | 92.16% | 81.01% | 63.40% | 77.97% |
| GPT-5 Closed | 74.58% | 88.63% | 78.42% | 59.18% | 70.70% |
| Claude Sonnet 4.5 Closed | 48.80% | 65.21% | 52.32% | 40.26% | 51.65% |
| Qwen3-VL-32B Open | 74.30% | 84.31% | 74.19% | 48.78% | 64.07% |
| InternVL3.5-30B Open | 72.16% | 83.13% | 67.12% | 37.97% | 56.15% |
| Qwen3-VL-8B Open | 73.02% | 85.10% | 69.44% | 45.26% | 60.04% |
Even GPT-5 drops from 88.63% on object description to 59.18% on taxonomy reasoning—a 29+ point decline revealing weak property-level understanding.
PercepTax ICL improves Gemini from 14 to 34 correct expert questions, demonstrating that hierarchical reasoning learned from simulation transfers to real-world scenarios.
Qwen3-VL-32B achieves 74.30% on spatial reasoning, surpassing closed-source models and showing strong open-source competitiveness.
Citation
@article{lee2024perceptax,
title={Perceptual Taxonomy: Evaluating and Guiding
Hierarchical Scene Reasoning in Vision-Language Models},
author={Lee, Jonathan and Wang, Xingrui and Peng, Jiawei and
Ye, Luoxin and Zheng, Zehan and Zhang, Tiezheng and
Wang, Tao and Ma, Wufei and Chen, Siyi and
Chou, Yu-Cheng and Kaushik, Prakhar and Yuille, Alan},
journal={arXiv preprint arXiv:2511.19526},
year={2024}
}