Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction

Feed-forward 3D reconstruction from inconsistent, distractor-heavy image collections.

Tianbo Pan1, Xingyi Yang2, Shizun Wang1, Xinchao Wang1
1National University of Singapore
2Hong Kong Polytechnic University
Teaser figure for VGTW

VGTW suppresses transient distractors while preserving geometry that stays consistent across views.

Abstract

Current end-to-end multi-view 3D reconstruction methods achieve impressive results, but they rely on a restrictive static-world assumption: input collections are expected to be distractor-free and geometrically consistent across views. This assumption breaks in real captures, where transient objects, occluders, and inconsistent viewpoints frequently appear.

We propose Visual Geometry Transformer in the Wild (VGTW), a feed-forward framework for robust reconstruction from inconsistent image sets. Our core idea is to explicitly isolate distractor-contaminated regions while preserving the stable scene content shared across views. To do this, we introduce a Distractor-aware Training (DAT) strategy with distractor suppression, cross-view consistency, and an auxiliary mask-prediction head trained from pixel-level distractor annotations.

The resulting model directly predicts clean point maps and depth without extra 3D supervision, and it generalizes well to challenging in-the-wild scenes with heavy occlusions.

Method Overview

VGTW builds on the feed-forward geometry prediction pipeline and introduces a distractor-aware training strategy tailored for inconsistent image collections. Image tokens interact through frame-wise and global attention, while DAT explicitly discourages distractor leakage and preserves consistent geometry.

VGTW framework overview

Experiments

Method Processing Low Occlusion Medium Occlusion High Occlusion Overall Average
Acc ↓ Comp ↓ NC ↑ Acc ↓ Comp ↓ NC ↑ Acc ↓ Comp ↓ NC ↑ Acc ↓ Comp ↓ NC ↑
DUSt3R Pair-wise 0.043 0.035 0.682 0.044 0.102 0.763 0.026 0.102 0.798 0.037 0.080 0.747
MaSt3R Pair-wise 0.044 0.120 0.636 0.064 0.136 0.691 0.028 0.216 0.751 0.045 0.157 0.692
Fast3R Multi-image 0.037 0.049 0.639 0.057 0.062 0.681 0.030 0.096 0.720 0.041 0.069 0.680
VGGT Multi-image 0.040 0.057 0.663 0.051 0.120 0.632 0.032 0.262 0.625 0.041 0.146 0.640
VGTW(VGGT) Multi-image 0.033 0.029 0.695 0.041 0.095 0.724 0.025 0.228 0.693 0.033 0.117 0.704
π3 Multi-image 0.034 0.051 0.699 0.084 0.096 0.676 0.036 0.078 0.753 0.051 0.074 0.709
VGTW(π3) Multi-image 0.028 0.033 0.625 0.035 0.073 0.715 0.019 0.076 0.734 0.027 0.060 0.692

Qualitative Comparison

Compare reconstructed geometry from VGTW, VGGT, and Easi3R across representative scenes.

VGTW
VGGT
Easi3R

BibTeX

@misc{pan2026vgtw,
  title={Visual Geometry Transformer in the Wild: Distractor-Free 3D Reconstruction},
  author={Pan, Tianbo and Yang, Xingyi and Wang, Shizun and Wang, Xinchao},
  year={2026},
  note={Project page manuscript}
}