TexPainter: Generative Mesh Texturing with Multi-view Consistency

Hongkun Zhang, Zherong Pan, Congyi Zhang, Lifeng Zhu*, Xifeng Gao,
SIGGRAPH 2024
TexPainter

Our method consistently generates high-quality texture images with multi-view consistency, some of which are illustrated with two views for each model, along with the text prompt below.

Abstract

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts.

TexPainter method

TexPainter Pipeline: Our modified multi-DDIM procedure that enforces multi-view consistency. Each view runs a separate denoising procedure using DDIM scheme. For each denoising step, DDIM predicts a latent code \(\hat{z}_{0,t}^i\) for the \(i\)th view at 0th timestep. These \(\hat{z}_{0,t}^i\) are decoded to the color space, yielding \(\hat{x}_{0,t}^i\). We then blend these views into a common color-space texture image by weighted averaging. Next, we perform an optimization to update \(\hat{z}_{0,t}^i\) into \(\bar{z}_{0,t}^i\) for all views, such that their decoded images match their corresponding rendered views using the blended texture image. These updated latent codes are then plugged into DDIM to predict the next noise level.

More results