3D Cinemagraphy from a Single Image

CVPR 2023

Xingyi Li^1,3, Zhiguo Cao¹, Huiqiang Sun¹, Jianming Zhang², Ke Xian^3*, Guosheng Lin³

¹Huazhong University of Science and Technology ²Adobe Research ³S-Lab, Nanyang Technological University

Paper

Abstract

We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography. Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion. We empirically find that naively combining existing 2D image animation and 3D photography methods leads to obvious artifacts or inconsistent animation. Our key insight is that representing and animating the scene in 3D space offers a natural solution to this task. To this end, we first convert the input image into feature-based layered depth images using predicted depth values, followed by unprojecting them to a feature point cloud. To animate the scene, we perform motion estimation and lift the 2D motion into the 3D scene flow. Finally, to resolve the problem of hole emergence as points move forward, we propose to bidirectionally displace the point cloud as per the scene flow and synthesize novel views by separately projecting them into target image planes and blending the results. Extensive experiments demonstrate the effectiveness of our method. A user study is also conducted to validate the compelling rendering results of our method.

Method

Given a single still image as input, we first predict a dense depth map. To represent the scene in 3D space, we separate the input image into several layers according to depth discontinuities and apply context-aware inpainting, yielding layered depth images (LDIs). We then use a 2D feature extractor to encode 2D feature maps for each inpainted LDI color layer, resulting in feature LDIs. Subsequently, we lift feature LDIs into 3D space using corresponding depth values to obtain a feature point cloud. To animate the scene, we estimate a 2D motion field from the input image and apply Euler integration to generate forward and backward displacement fields. We then augment displacement fields with estimated depth values to obtain 3D scene flow fields. Next, we bidirectionally displace the feature point cloud as per the scene flow and separately project them into target image planes to obtain feature maps. Finally, we blend them together and pass the result through our image decoder to synthesize a novel view at time t.