ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting

Official project page of ProFuse. Click the links for more information.

Overview figure

Abstract

We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. We introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is 2× faster than SOTA.


3D Visualizations

Efficient Pre-registration via dense correspondences

Pre-registration figure

A pretrained dense matcher supplies cross-view correspondences. We triangulate matched pixels to initialize a compact Gaussian scene with accurate geometry, while simultaneously enabling cross-view grouping for context.


Global feature fusion with 3D Context Proposals

Global feature fusion figure

Warped masks are clustered across views into 3D Context Proposals. Each proposal aggregates a global feature, which is fused onto Gaussians during direct registration to keep per-primitive language coherence across views.


Results

Evaluation figure 1
3D object selection table

ProFuse gives an outstanding performance in 3D Object Selection task. Our method isolates the queried object with far fewer background activations and ray-like spillovers, yielding more semantically precise selections.

Evaluation figure 2

In terms of point-level understanding, we achieve cleaner boundaries at furniture edges and fixtures with fewer mixed colors at object–wall contacts.


Citation

@misc{chiou2026profuseefficientcrossviewcontext,
      title={ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting}, 
      author={Yen-Jen Chiou and Wei-Tse Cheng and Yuan-Fu Yang},
      year={2026},
      eprint={2601.04754},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.04754}, 
}