SceneDreamer is an unconditional generative model for unbounded 3D scenes that can synthesize large-scale 3D landscapes from random noises. The framework is learned from in-the-wild 2D image collections without any 3D annotations.
At the core of the tool is a principled learning paradigm comprising an efficient and expressive 3D scene representation, a generative scene parameterization, and an effective renderer that leverages the knowledge from 2D images.
SceneDreamer employs an efficient bird’s-eye-view (BEV) representation generated from simplex noise, which consists of a height field and a semantic field.
The height field represents the surface elevation of 3D scenes, while the semantic field provides detailed scene semantics. The BEV scene representation enables the tool to represent a 3D scene with quadratic complexity, disentangle geometry and semantics, and perform efficient training.
The tool proposes a novel generative neural hash grid to parameterize the latent space given 3D positions and the scene semantics, which aims to encode generalizable features across scenes and align content.
Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. SceneDreamer is effective in generating vivid and diverse unbounded 3D worlds and is superior to state-of-the-art methods in this regard.