SceneTeract

Agentic Functional Affordances and VLM Grounding in 3D Scenes

Léopold Maillard^1,2,†,*, Francis Engelmann^3,4, Tom Durand², Boxiao Pan³, Yang You³, Or Litany^5,6, Leonidas Guibas³, Maks Ovsjanikov¹

¹École Polytechnique ²Dassault Systèmes ³Stanford University ⁴USI Lugano ⁵Technion ⁶NVIDIA

arXiv 2026
^†Work done during a research visit to Stanford University
^*Corresponding author: leopold.maillard@polytechnique.edu

arXiv Paper Code Dataset

SceneTeract Overview

SceneTeract is a verification engine that, given a 3D scene, an embodied agent profile, and a target activity, decomposes the task into atomic actions and validates each step with explicit geometric and physical checks, producing finegrained and actionable feasibility diagnostics.

1. Agent-Centric Input

Affordance is not an intrinsic property of a scene, but a relational one. Along with the 3D scene representation and an activity description, SceneTeract takes as input an embodied Agent Profile encompassing distinct physical capabilities like mobility factors and reach distances.

2. Activity Decomposition and Planning

Complex indoor activities can be decomposed into sequences of simple, atomic agent-object interactions. Leveraging the semantic reasoning capabilities of VLMs, we plan these activities using a fixed, closed-vocabulary library of atomic actions.

3. Geometric Grounding

The success of any atomic action depends on satisfying a finite set of physical and spatial constraints. We verify these properties (e.g., navigation, reachability, clearance) using explicit 3D geometric tools, ensuring that agent-specific limitations are respected.

4. Unified Deployment across Evaluation and Training

We apply SceneTeract as a single verification engine for scene auditing, VLM functional judgment benchmarking, and reinforcement-learning reward supervision, showing that its granular reports support both failure diagnosis and downstream model improvement.

Abstract

Embodied AI depends on interactive 3D environments that support meaningful activities for diverse users, yet assessing their functional affordances remains a core challenge. We introduce SceneTeract, a framework that verifies 3D scene functionality under agent-specific constraints. Our core contribution is a grounded verification engine that couples high-level semantic reasoning with low-level geometric checks. SceneTeract decomposes complex activities into sequences of atomic actions and validates each step against accessibility requirements (e.g., reachability, clearance, and navigability) conditioned on an embodied agent profile, using explicit physical and geometric simulations. We deploy SceneTeract to perform an in-depth evaluation of (i) synthetic indoor environments, uncovering frequent functional failures that prevent basic interactions, and (ii) the ability of frontier Vision-Language Models (VLMs) to reason about and predict functional affordances, revealing systematic mismatches between semantic confidence and physical feasibility even for the strongest current models. Finally, we leverage SceneTeract as a reward engine for VLM post-training, enabling scalable distillation of geometric constraints into reasoning models. We release the SceneTeract verification suite and data to bridge perception and physical reality in embodied 3D scene understanding.

Method

Step 1

Input Space

Defining the agent properties, multi-object scene, and high-level task.

Step 2

Semantic Planner

VLM-driven decomposition into a sequence of atomic actions.

Step 3

Geometric Grounding

Validating actions against physical and spatial properties.

Step 4

Diagnostic Report

Pinpointing explicit, actionable failure points within the plan.

Input Triplet \( (\mathcal{A}, \mathcal{S}, \mathcal{T}) \)

The verification context is defined as a scene-task-agent input triplet.

Agent Profile \( \mathcal{A} \)

Agent parameterization defining its navigation and manipulation properties.

Multi-object 3D Scene \( \mathcal{S} \)

Rendering \( \Pi_{\text{vis}}(\cdot) \)

Structured input \( \Pi_{\text{txt}}(\cdot) \)

{
  "scene_id": "LivingRoom-1097",
  "objects": [
    {
      "id": 0,
      "category": "multi_seat_sofa",
      "position": [0.74, 0.53, 0.08]
    }, ...
  ]
}

Task \( \mathcal{T} \)

An open-ended, complex indoor activity in natural language.

"Grab something to eat in the cabinet and relax watching TV."

From this input triplet, a VLM plans the task using a predefined set of primitive actions.

Semantic Planning \( \Phi \)

A VLM translates open-ended tasks into executable sequences.

> System: You are an expert in task planning for embodied agents in 3D environments. Your task is to decompose the user's high-level activity into a sequence of atomic actions from the provided library.

Atomic Action Library \( \mathbb{A} \)

Each atomic action is paired with a target scene object to form an interaction tuple \( (a, o) \). They are categorized into four functional families.

Mobility \( \mathbb{A}_{\text{m}} \)

NavigateTo SitOn LieOn

Contact \( \mathbb{A}_{\text{c}} \)

Toggle PickUpFrom ReleaseOn

Handling \( \mathbb{A}_{\text{h}} \)

Open Close PutIn TakeOutOf

Perception \( \mathbb{A}_{\text{p}} \)

LookAt

Generated Plan \( \pi \) A VLM \( \Phi \) decomposes the activity from the multimodal input context, yielding a multi-step action plan.

Each action is then mapped to explicit physical properties to verify geometric feasibility.

Geometric Grounding \( \Psi \)

Mapping semantic actions to explicit physical, agent-aware verifications.

Based on their family membership, actions are mapped to a sequence of geometric checks that are validated against the 3D scene and the agent's physical constraints.

Action Family	\( \mathcal{P}_{\text{nav}} \)	\( \mathcal{P}_{\text{reach}} \)	\( \mathcal{P}_{\text{inter}} \)	\( \mathcal{P}_{\text{clear}} \)	\( \mathcal{P}_{\text{vis}} \)
Mobility \( \mathbb{A}_{\text{m}} \)	1	—	—	—	—
Contact \( \mathbb{A}_{\text{c}} \)	1	2	—	—	—
Handling \( \mathbb{A}_{\text{h}} \)	1	2	3	4	—
Perception \( \mathbb{A}_{\text{p}} \)	—	—	—	—	1

Select a geometric property column header to view its implementation details.

\(\mathcal{P}_{\text{nav}}\) : `isNavigableTo`

Agent-Specific Navigation Map: First, the 3D scene geometry is projected onto a 2D occupancy grid. This grid is then morphologically eroded based on the agent's specific clearance width \( w_{\text{clear}} \), segmenting the scene into distinct navigation regions.

Zone Resolution & Connectivity: A VLM resolves the correct semantic interaction zone(s) around the target object (e.g. the front of the couch). The system then verifies 2D path connectivity from the agent to this zone.

Largest connected area

Isolated connected area

Agent collision area

(SitOn, couch)

Candidate zones

Resolved navigable area

Agent collision area

\(\mathcal{P}_{\text{reach}}\) : `isReachable`

Computes the minimum Euclidean distance from the agent's connected navigable floor region to the target mesh, vertically translated by \( h_{\text{arm}} \), and validated against the maximum reach radius \( r_{\text{arm}} \).

(PickUpFrom, coffee_table)

Object reach area

Out-of-reach area

Inaccessible area

\(\mathcal{P}_{\text{inter}}\) : `isInteractable`

Functional Part Identification: A VLM (Molmo) is prompted with orthographic views to identify interaction points on the functional part of the object (e.g. handles) given the intended action. These points serve as prompts for SAM to extract a precise 2D mask.

3D Volume & Proximity: Using depth maps, masked pixels are deprojected into world-space coordinates across all views to form a unified 3D point cloud representing the interactable volume. We finally verify if this specific volume is within reach of the agent.

(Open, tv_stand)

Left to right: Render view, predicted VLM interaction point, extracted SAM mask.

(Open, tv_stand)

Interactable volume

\(\mathcal{P}_{\text{clear}}\) : `hasClearance`

Verifies if there is sufficient kinematic space in front of the target object to allow for articulation (e.g., door swing). A 3D interaction volume \(\mathcal{V}_{\text{clear}}\) is generated facing the interaction side and tested for collisions against the scene geometry.

(Open, tv_stand)

Collision-free clearance box

Colliding clearance box

\(\mathcal{P}_{\text{vis}}\) : `isVisible`

Determines if the target object is in the agent's line of sight, accounting for occlusion. \(\mathcal{P}_{\text{vis}}\) casts multiple rays from the agent's posture-adjusted eye position \( e_y \) to keypoints on the target's bounding box and computes a visibility ratio.

(LookAt, tv_stand)

Collision-free agent-to-object rays

Colliding agent-to-object rays

Verification results are finally aggregated into an actionable, fine-grained diagnostic report.

Diagnostic Report \( \mathcal{R} \)

SceneTeract provides grounded, granular, and actionable insights about the validity of the plan.

Overall Feasibility: FAILED

Step 1 (NavigateTo, cabinet_1) PASS

\(\mathcal{P}_{\text{nav}}\)

A collision-free path was found to an interaction zone.

Step 2 (Open, cabinet_1) FAIL

\(\mathcal{P}_{\text{nav}}\)

A collision-free path was found to an interaction zone.

\(\mathcal{P}_{\text{reach}}\)

Object is reachable. Required distance: 0.35m, Agent's reach: 0.40m.

\(\mathcal{P}_{\text{inter}}\)

Interactable volume is unreachable. Required distance: 0.52m, Agent's reach: 0.40m.

\(\mathcal{P}_{\text{clear}}\)

Found 1 collision-free interaction zones.

Step 3 (NavigateTo, sofa_1) PASS

\(\mathcal{P}_{\text{nav}}\)

A collision-free path was found to an interaction zone.

Step 4 (SitOn, sofa_1) PASS

\(\mathcal{P}_{\text{nav}}\)

A collision-free path was found to an interaction zone.

Actionable Insight

The plan fails because the Child agent cannot open the cabinet_1. Although the cabinet is physically reachable, its functional handle is located at a distance of 0.52m, which physically exceeds the Child's maximum reach limit of 0.40m.

Downstream Applications

3D Scene Auditing

Identify architectural failure modes preventing basic accessibility across different user profiles.

VLM Benchmarking

Quantify the gap between modern VLMs' semantic confidence and physical reasoning abilities.

GRPO Post-Training

Use binary checks as a scalable reward signal to distill geometry constraints into language models.

Experiments

We deploy the SceneTeract verifier across three distinct applications to highlight the functional gaps in modern 3D synthetic scenes and VLMs, and demonstrate how geometric grounding can improve reasoning models.

3D Scene Auditing
VLM Benchmarking
GRPO Post-Training

Auditing Synthetic Environments

We benchmark the readiness of modern 3D environments for embodied interaction by evaluating complex activities across 3,396 scene-task-agent configurations from the 3D-FRONT dataset. As shown below, synthetic environments exhibit broad functional feasibility gaps across the three user profiles, indicating that visually plausible arrangements often fail to support basic everyday actions, especially for mobility-constrained agents where restrictive spatial layouts act as a primary bottleneck.

Overall & Action Success Rates

Metric	Adult	Child	Wheelchair User
Task Success	59.0%	66.0%	42.5%
is_Navigable_To	84.7%	89.2%	73.0%
is_Reachable	91.9%	93.7%	80.7%
is_Interactable	75.6%	73.4%	64.1%
is_Visible	98.7%	97.8%	97.7%
has_Clearance	67.3%	66.3%	66.6%

Atomic Action Success Rates

Quantifying VLM Spatial Reasoning

We evaluate frontier and open-weight VLMs on their ability to correctly predict physical affordances. We compare native task evaluation (Direct) against our granular, step-by-step action-level verification (Decomposed). Click on any metric header below to see its detailed description.

Select a metric column header to view its exact definition and role.

Action Accuracy

Measures a model's spatial perception and judgment at the most granular, atomic level. It asks: can the model correctly determine if a single, isolated action (like opening a specific drawer) is feasible given the agent's profile?

Task Accuracy

Measures the capacity to evaluate complete, multi-step human activities. In the decomposed setting, task success is defined only if the model predicts all constituent actions in the sequence to be feasible.

False Positive Rate (FP)

Quantifies physical hallucinations: how frequently a model predicts an impossible task is possible. A lower FP rate indicates a model is better at recognizing geometric and physical bottlenecks (like obstructed paths or out-of-reach objects).

Matthews Correlation Coefficient (MCC)

Because our dataset contain natural class imbalances, MCC provides a robust, unified measure of classification quality. A higher score signifies better overall discrimination.

Inclusivity Gap (InGap)

The maximum difference in task accuracy across the three agent profiles (Adult, Child, and Wheelchair user). A lower InGap indicates a fairer model that successfully internalizes the unique physical constraints of specific, diverse embodiments.

Horizon Stability Index (HSI)

Measures resilience to compounding task complexity. Defined as the ratio of task accuracy on long-horizon tasks (3+ action steps) versus short-horizon tasks (1–2 steps). A score closer to 100% indicates strong invariance to task length.

Consistency (Cons.)

The percentage of tasks where a model's direct holistic prediction logically matches its decomposed step-by-step conclusion. This reveals a model's native capacity to break down complex activities internally without contradicting itself upon closer inspection.

Hover over the highlighted colored cells to reveal key benchmark insights.

Model Configuration		Action	Task			Reliability
Model Configuration		Acc ↑	Acc ↑	FP ↓	MCC ↑	InGap ↓	HSI ↑	Cons. ↑
Gemini-3-Flash-Preview	Direct	—	62.1	35.7	0.144	16.5	79.2	58.5
Gemini-3-Flash-Preview	Decomposed	77.1	69.9	12.0	0.390	7.5	94.7	58.5
Gemini-3.1-Pro-Preview	Direct	—	61.7	22.4	0.182	4.5	89.3	53.3
Gemini-3.1-Pro-Preview	Decomposed	72.1	61.6	6.0	0.321	6.9	112.7	53.3
Claude-Sonnet-4-6	Direct	—	62.8	21.9	0.208	12.8	97.9	81.4
Claude-Sonnet-4-6	Decomposed	73.1	62.3	20.6	0.201	19.4	100.4	81.4
Qwen3-VL-8B-Instruct	Direct	—	61.5	32.9	0.130	17.6	82.7	69.5
Qwen3-VL-8B-Instruct	Decomposed	67.5	62.3	20.1	0.204	6.6	99.2	69.5
Gemma3-12B-Instruct	Direct	—	59.3	40.2	-0.055	22.5	69.4	93.7
Gemma3-12B-Instruct	Decomposed	75.0	61.7	36.1	0.116	17.2	78.2	93.7
Ministral3-3B-Instruct	Direct	—	54.0	33.3	-0.049	15.0	80.0	21.7
Ministral3-3B-Instruct	Decomposed	34.5	41.7	0.7	0.069	21.6	144.7	21.7
Gemma3-4B-Instruct	Direct	—	59.8	40.2	0.000	22.5	72.3	97.9
Gemma3-4B-Instruct	Decomposed	74.8	60.1	38.8	0.010	18.9	71.9	97.9
Qwen3-VL-4B-Instruct	Direct	—	60.8	32.6	0.111	20.3	87.3	64.6
Qwen3-VL-4B-Instruct	Decomposed	70.4	61.4	22.8	0.172	13.2	97.6	64.6
↳ with GRPO (ours)	Direct	—	61.5	32.9	0.130	17.2	83.8	62.3
↳ with GRPO (ours)	Decomposed	75.3	69.2	14.1	0.364	9.7	98.2	62.3

Insight Title

Insight text goes here.

Supervised Fine-Tuning (SFT) for spatial reasoning is often prone to superficial alignment and catastrophic forgetting. Instead, we formulate spatial alignment as a reinforcement learning problem using Group Relative Policy Optimization (GRPO). By using SceneTeract's deterministic grounding engine as a scalable, automated reward signal, we can distill geometric constraints directly into the reasoning paths of a Vision-Language Model.

Group Advantage Formulation

For a given atomic action step, the VLM is prompted to sample a group of \( G \) independent Chain-of-Thought (CoT) reasoning paths, each concluding with a predicted action feasibility judgment \( \hat{v}_i \). SceneTeract assigns a sparse reward to each completion by directly comparing it against the grounded verifier label \( v \):

\( r_i = \mathbb{1}[\hat{v}_i = v] \)

The model policy is then updated by maximizing the group-relative advantage:

\( A_i = \frac{r_i - \mu_r}{\sigma_r + \epsilon} \)

where \( \mu_r \) and \( \sigma_r \) are the mean and standard deviation of the rewards within the sampled group.

Why this works

"Can the agent sit on the couch?"

Yes

                                  A < 0

                                  A > 0

Yes

                                  A < 0

                                  A > 0

                                  A > 0

Yes

                                  A < 0

By contrasting successful reasoning paths against failed ones for the exact same query, the model internalizes true notions of 3D functional awareness instead of superficially memorizing spatial layouts.

We design a composite reward function to guide the VLM towards accurate, physically-grounded predictions. Below we detail the reward components and present a learned reasoning trace that highlights the model's acquired spatial intuition.

Reward Formulation

Format Reward +0.5

Enforces strictly structured Chain-of-Thought reasoning. The model must encapsulate its logic within <think>...</think> and its final boolean verdict within <answer>...</answer> tags.

Asymmetric Correctness +2.0 / +3.0

Matches the model's final prediction against the SceneTeract geometric verifier. To counter dataset class imbalances, a correct False prediction (identifying a physical failure) yields a higher reward than a correct True prediction.

Spatial Reasoning +1.0

A targeted dense reward triggered if the model's reasoning trace explicitly mentions keywords tied to the specific physical failure identified by the engine (e.g., "reach", "too far", "clearance", "narrow").

Learned Reasoning Trace

Qwen3-VL-4B (GRPO Checkpoint)

                          Agent: Wheelchair User

                          Goal: sit_on the dining_chair
                        
<think>

                          The agent is seated in a manual wheelchair. To sit on the dining chair, the agent must first
                          navigate to it, which requires a clear path and sufficient space for the wheelchair to
                          approach and stop. The dining chair is positioned at the head of the dining table on the right
                          side of the room.


                          There are no objects obstructing the direct path to it. However, the chair is located in a
                          corner or tight space next to the table, and there is a dining table with multiple chairs
                          surrounding it, which may reduce clearance. The
                            wheelchair's width (typically around 60-70 cm) must fit into the space around the chair,
                            which appears to be tight with no apparent clearance or buffer space.


                          Additionally, the red bounding box indicates the chair is selected, but visually, the space
                          around it seems too narrow to allow the wheelchair to maneuver in and sit without hitting
                          adjacent furniture. Hence, the agent cannot successfully perform the action "sit_on" on the
                          "dining_chair" due to insufficient clearance and navigational constraints.
                        
</think>

                          <answer>False</answer>
                        

Observation: Even though specific numerical dimensions were not provided in the prompt, the model independently derives a sensible clearance width for a wheelchair (~65cm), matching the parameters used in the SceneTeract verification engine.

Interactive 3D Verification

Experience how SceneTeract decomposes and grounds the activity within the 3D scene. Click Play Trace to visualize the step-by-step verification.

BibTeX

@article{maillard2026sceneteract,
  title={{SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes}},
  author={Maillard, L{\'e}opold and Engelmann, Francis and Durand, Tom and Pan, Boxiao and You, Yang and Litany, Or and Guibas, Leonidas and Ovsjanikov, Maks},
  year={2026},
  url={https://sceneteract.github.io}
}

More Works from the Authors

DeBaRA: Denoising-Based 3D Room Arrangement Generation

LACONIC: A 3D Layout Adapter for Controllable Image Creation

SceneTeract

Agentic Functional Affordances and VLM Grounding in 3D Scenes

SceneTeract Overview

1. Agent-Centric Input

2. Activity Decomposition and Planning

3. Geometric Grounding

4. Unified Deployment across Evaluation and Training

Abstract

Method

Input Space

Semantic Planner

Geometric Grounding

Diagnostic Report

Input Triplet \( (\mathcal{A}, \mathcal{S}, \mathcal{T}) \)

Semantic Planning \( \Phi \)

Atomic Action Library \( \mathbb{A} \)

Mobility \( \mathbb{A}_{\text{m}} \)

Contact \( \mathbb{A}_{\text{c}} \)

Handling \( \mathbb{A}_{\text{h}} \)

Perception \( \mathbb{A}_{\text{p}} \)

Geometric Grounding \( \Psi \)

\(\mathcal{P}_{\text{nav}}\) : isNavigableTo

\(\mathcal{P}_{\text{reach}}\) : isReachable

\(\mathcal{P}_{\text{inter}}\) : isInteractable

\(\mathcal{P}_{\text{clear}}\) : hasClearance

\(\mathcal{P}_{\text{vis}}\) : isVisible

Diagnostic Report \( \mathcal{R} \)

Experiments

Auditing Synthetic Environments

Overall & Action Success Rates

Atomic Action Success Rates

Quantifying VLM Spatial Reasoning

Action Accuracy

Task Accuracy

False Positive Rate (FP)

Matthews Correlation Coefficient (MCC)

Inclusivity Gap (InGap)

Horizon Stability Index (HSI)

Consistency (Cons.)

Insight Title

Improving VLMs via Geometric Rewards

Group Advantage Formulation

Reward Formulation

Learned Reasoning Trace

Interactive 3D Verification

BibTeX

\(\mathcal{P}_{\text{nav}}\) : `isNavigableTo`

\(\mathcal{P}_{\text{reach}}\) : `isReachable`

\(\mathcal{P}_{\text{inter}}\) : `isInteractable`

\(\mathcal{P}_{\text{clear}}\) : `hasClearance`

\(\mathcal{P}_{\text{vis}}\) : `isVisible`