We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. We introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is 2× faster than SOTA.
A pretrained dense matcher supplies cross-view correspondences. We triangulate matched pixels to initialize a compact Gaussian scene with accurate geometry, while simultaneously enabling cross-view grouping for context.
Warped masks are clustered across views into 3D Context Proposals. Each proposal aggregates a global feature, which is fused onto Gaussians during direct registration to keep per-primitive language coherence across views.
ProFuse gives an outstanding performance in 3D Object Selection task. Our method isolates the queried object with far fewer background activations and ray-like spillovers, yielding more semantically precise selections.
In terms of point-level understanding, we achieve cleaner boundaries at furniture edges and fixtures with fewer mixed colors at object–wall contacts.
@misc{chiou2026profuseefficientcrossviewcontext,
title={ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting},
author={Yen-Jen Chiou and Wei-Tse Cheng and Yuan-Fu Yang},
year={2026},
eprint={2601.04754},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04754},
}