Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

Guofeng Mei¹, Wei Lin², Luigi Riz¹, Yujiao Wu³,
Yiming Wang¹, Fabio Poiesi¹

¹Fondazione Bruno Kessler, Italy, ²JKU Linz, Austria, ³CSIRO, Australia

CVPR 2026

Paper Code(coming soon) Models(coming soon) Dataset(coming soon)

Latest News

Feb 2026 Fase3D is accepted to CVPR 2026! 🎉
Feb 2026 Preprint is available on arXiv.
Upcoming Code and model weights will be released soon.

Most 3D large multimodal models (LMMs) rely on computationally heavy scene encoders to extract geometric features. Fase3D eliminates this redundancy. By employing a lightweight Fourier-based tokenizer and Fourier-augmented LoRA adapters, Fase3D processes raw point clouds directly. This allows us to bypass visual encoders entirely, infusing global frequency-aware context into the LLM with negligible overhead while maintaining state-of-the-art performance.

Abstract

Large Multimodal Models (LMMs) processing 3D data typically rely on heavy, pre-trained visual encoders. While recent 2D LMMs eliminate such encoders for efficiency, extending this to 3D remains challenging due to the unordered and large-scale nature of point clouds. We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM.

Fase3D leverages a novel tokenizer combining point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables scalable global context modeling and graph-based token merging. Coupled with lightweight Fourier-augmented LoRA adapters, Fase3D injects global frequency-aware interactions into the LLM at negligible computational cost.

Key Innovations

Structured Superpoints

We group massive 3D point clouds into compact superpoint tokens. This drastically reduces sequence length while preserving crucial geometric details.

FFT & Serialization

Unordered points are serialized via a space-filling curve. We then use the Fast Fourier Transform (FFT) to model global context efficiently, avoiding quadratic self-attention costs.

Fourier-Augmented LoRA

Lightweight adapters inject global frequency-aware interactions directly into the pre-trained LLM, enhancing vision context at negligible computational cost.

Architecture

Pipeline Overview: A lightweight tokenizer (●) produces M superpoint tokens, which are refined by an FFT-based context enhancer (●). A graph is then constructed, and a token-merging block (●) compresses them into T compact 3D tokens (where T < M). Finally, an LLM (●) equipped with an FFT-based global filter (●) processes these tokens alongside textual and user prompts.

Why it works: Unlike traditional pipelines requiring heavy encoders like PointNet++ or Sparse Convolutions, Fase3D directly processes geometric features. By serializing spatial coordinates and passing them through a continuous Fourier transform layer, we dynamically fuse geometry with textual inputs. This design guarantees both permutation invariance and linear scaling with respect to the number of input points.

Qualitative Results

Fase3D demonstrates strong capabilities in 3D Scene QA and Dense Captioning.

3D Scene QA Performance

3D Scene Dense Captioning Performance

Citation

@inproceedings{mei2026fase3d,
    title={Efficient Encoder-Free Fourier-based 3D Large Multimodal Model},
    author={Mei, Guofeng and Lin, Wei and Riz, Luigi and Wu, Yujiao and Wang, Yiming and Poiesi, Fabio},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026}
  }