Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

Feed-forward 3D reconstruction from inconsistent, distractor-heavy image collections.

Tianbo Pan¹, Xingyi Yang², Shizun Wang¹, Xinchao Wang¹

¹National University of Singapore
²Hong Kong Polytechnic University

Abstract

Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but they rely on a restrictive static-world assumption: input collections are expected to be distractor-free and geometrically consistent across views. This assumption breaks in real captures, where transient objects, occluders, and inconsistent viewpoints frequently appear.

We propose Visual Geometry Transformer in the Wild (VGTW), a feed-forward framework for robust reconstruction from inconsistent image sets. Our core idea is to explicitly isolate distractor-contaminated regions while preserving the stable scene content shared across views. To do this, we introduce a Distractor-aware Training (DAT) strategy with distractor suppression, cross-view consistency, and an auxiliary mask-prediction head trained from pixel-level distractor annotations.

The resulting model directly predicts clean point maps and depth without extra 3D supervision, and it generalizes well to challenging in-the-wild scenes with heavy occlusions.

Method Overview

VGTW builds on the feed-forward geometry prediction pipeline and introduces a distractor-aware training strategy tailored for inconsistent image collections. Image tokens interact through frame-wise and global attention, while DAT explicitly discourages distractor leakage and preserves consistent geometry.

Experiments

Method	Processing	Low Occlusion			Medium Occlusion			High Occlusion			Overall Average
Method	Processing	Acc ↓	Comp ↓	NC ↑	Acc ↓	Comp ↓	NC ↑	Acc ↓	Comp ↓	NC ↑	Acc ↓	Comp ↓	NC ↑
DUSt3R	Pair-wise	0.043	0.035	0.682	0.044	0.102	0.763	0.026	0.102	0.798	0.037	0.080	0.747
MaSt3R	Pair-wise	0.044	0.120	0.636	0.064	0.136	0.691	0.028	0.216	0.751	0.045	0.157	0.692
Fast3R	Multi-image	0.037	0.049	0.639	0.057	0.062	0.681	0.030	0.096	0.720	0.041	0.069	0.680
VGGT	Multi-image	0.040	0.057	0.663	0.051	0.120	0.632	0.032	0.262	0.625	0.041	0.146	0.640
VGTW(VGGT)	Multi-image	0.033	0.029	0.695	0.041	0.095	0.724	0.025	0.228	0.693	0.033	0.117	0.704
π³	Multi-image	0.034	0.051	0.699	0.084	0.096	0.676	0.036	0.078	0.753	0.051	0.074	0.709
VGTW(π³)	Multi-image	0.028	0.033	0.625	0.035	0.073	0.715	0.019	0.076	0.734	0.027	0.060	0.692

Qualitative Comparison

Compare reconstructed geometry from VGTW, VGGT, and Easi3R across representative scenes.

VGTW

VGGT

Easi3R

@misc{pan2026vgtw, title={Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction}, author={Pan, Tianbo and Yang, Xingyi and Wang, Shizun and Wang, Xinchao}, year={2026}, note={Project page manuscript} }

Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

Abstract

Method Overview

Experiments

Qualitative Comparison

BibTeX