*Done while on internship at FAIR.
1VGG (University of Oxford)
2Facebook AI Research (FAIR)
3Facebook
4University of Michigan
End-to-end view synthesis: Given a single RGB image (red), SynSin generates images of the scene at new viewpoints (blue). SynSin predicts a 3D point cloud, which is projected onto new views using our differentiable renderer; the rendered point cloud is passed to a GAN to synthesise the output image. SynSin is trained end-to-end, without 3D supervision.
Abstract
We propose a method for single image view synthesis, allowing for the generation of new views of a scene from a single input image. This is challenging, as it requires comprehensively understanding the 3D scene from a single image. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task; it is trained on real images without any ground-truth 3D information. To this end, we introduce a novel differentiable point cloud renderer that is used to transform a latent 3D point cloud of features into the target view. The projected features are decoded by our refinement network to inpaint missing regions and generate a realistic output image. The 3D component inside of our generative model allows for interpretable manipulation of the latent feature space at test time, e.g. we can animate trajectories from a single image. Unlike prior work, we can generate high resolution images and generalise to other input resolutions. We outperform baselines and prior work on the Matterport, Replica, and RealEstate10K datasets.
Resources
Results
Sample generated trajectories: Given the initial image in the video, SynSin generates the next set of images.
Citation
@misc{Wiles19,
author={Olivia Wiles and Georgia Gkioxari and Richard Szeliski and Justin Johnson},
title={SynSin: End-to-end View Synthesis from a Single Image},
year={2019},
eprint={1912.08804},
archivePrefix={arXiv},
primaryClass={cs.CV}
}