Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang^*1, Ziyi Wang^*,1, Wenzhao Zheng^†,1, Zheng Zhu², Jie Zhou¹, Jiwen Lu¹

¹Department of Automation, Tsinghua University, China ²GigaAI
^*Equal Contribution, ^†Project Leader

Paper Code arXiv TrajScene-60K (Coming Soon)

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image

MoRe4D generates interactive, dynamic 4D scenes from a single static image. Unlike previous paradigms that decouple generation and reconstruction (leading to geometric inconsistencies), we tightly couple geometric modeling and motion generation, achieving consistent 4D motion and geometry.

Abstract

Generating interactive, dynamic 4D scenes from a single static image remains a core challenge. Most existing methods decouple geometry from motion (either generate-then-reconstruct or reconstruct-then-generate), causing spatiotemporal inconsistencies and poor generalization.

To overcome these limitations, we extend the reconstruct-then-generate framework to jointly couple Motion generation with geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module in 4D-STraG for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image.

Methodology: Unified 4D Synthesis

Our framework consists of two core components designed to ensure both geometric stability and dynamic realism:

4D Scene Trajectory Generator (4D-STraG): A joint diffusion model that simultaneously reconstructs and generates spatiotemporal point trajectories. It utilizes a Depth-Guided Motion Normalization to ensure scale invariance and a Motion Perception Module (MPM) to inject rich motion priors from the input image.
4D View Synthesis Module (4D-ViSM): Leveraging the dense 4D point cloud representation, this module synthesizes high-fidelity novel view videos, filling in dis-occluded regions coherently using generative priors.

TrajScene-60K Dataset

TrajScene-60K Curation (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

To address the data scarcity for 4D generation, we present TrajScene-60K, a large-scale dataset containing:

60,000 High-Quality Samples: Curated from WebVid-10M using VLM-based filtering (CogVLM2 & DeepSeek-V3) to ensure meaningful, countable, and self-initiated motion.
Dense Annotations: Includes dense 4D point trajectories, per-frame depth maps, and occlusion masks extracted via DELTA tracking and Gaussian Splatting rendering.
Rich Semantics: Paired with high-quality captions describing both scene content and dynamic behavior.

Generated Samples

Input

4D Point Tracking Generated From 4D-STraG

Multi-View Videos Generated From 4D-ViSM

Prompt: A brown bear walks across rocky terrain.

bear Input (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

Prompt: A camel walks along a path in a sunny zoo enclosure.

Camel Input (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

Prompt: A grey rhino strolls peacefully through the dappled sunlight.

Rhino Input (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

Qualitative Results

Multi-View & Trajectory Generation

Multi-view Generation (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

Our model generates consistent 4D point clouds (Top) and renders high-quality videos under arbitrary camera trajectories (Bottom).

Comparison with State-of-the-Art Single-Image-to-4D methods

Comparison with SOTA (Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from Single Image)

Visual comparison with 4Real, DimensionX, Gen3C, and Free4D. MoRe4D produces more diverse motion and preserves structural consistency better than decoupled approaches.

Quantitative Analysis

We perform a comprehensive quantitative evaluation using VBench. Following the protocol in Free4D, comparisons are structured into three groups based on model availability and trajectory complexity. Our method consistently achieves superior performance, particularly in dynamic degree and visual quality metrics.

BibTeX

@article{zhang2025more4d,
  title={Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image},
  author={Zhang, Yanran and Wang, Ziyi and Zheng, Wenzhao and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.05044}, 
}

More Works from Yanran Zhang

D³QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting