Most 3D large multimodal models (LMMs) rely on computationally heavy scene encoders to extract geometric features. Fase3D eliminates this redundancy. By employing a lightweight Fourier-based tokenizer and Fourier-augmented LoRA adapters, Fase3D processes raw point clouds directly. This allows us to bypass visual encoders entirely, infusing global frequency-aware context into the LLM with negligible overhead while maintaining state-of-the-art performance.
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model