1、基于VQ-DIFFUSION的文本到图像合成陈栋 微软亚洲研究院 高级研究经理|Text-to-imagea grey and white cat wearing a tie.Quality Measurement:1)diversity,2)realism,3)matching degree|DALL E v2Text-To-Image(T2I)A dog with goggles staring at the camera.A person is preparing some art.a horse is running on the grasslandSketch-To-Image(S2
2、I)Image Completion(I2I)Image Manipulation(TI2I)Text-To-Video(T2V)Sketch-To-Video(S2V)Video Prediction(V2V)Video Manipulation(TV2V)The car is reversinggrasswaterhouseskytreeflowercupwallvasedoortableNWANWA is a unified multimodal pre-trained model that can generate new or manipulate existing visual d
3、ata(i.e.,images and videos)for 8 visual synthesis tasks.Text-to-image is a hot research fieldDALL EGLIDEDALL E v2ImagenPartiNUWAVQ-DiffusionCogViewGAN-based model text-to-image model1 Reed,Scott,et al.Generative adversarial text to image synthesis.ICML,2016.2014GAN2016GAN-INT-CLS2017StackGAN2018Attn
4、GAN2019MirrorGANDM-GAN2020DF-GANCPGAN2021DAE-GANXMC-GANLimitation of GAN based methods Produce good result for single domain images,e.g.,birds,flowers*Imageis from DF-GANAttnGANDM-GANDF-GAN Cannot handle complex scenesAuto-regressive ModelDecodertext 64645538520743062017Auto-regressive Transformers2
5、021.02Dall-E(OpenAI)2021.05CogView(Tsinghua)2021.11NUWA(MSRA)2022/06Parti(Google)Denoising Diffusion Model2021.05Diffusion models beat GAN(Google)2021.11VQ-Diffusion(MSRA)2021.12GLIDE(OpenAI)2022.04Dall-E 2(OpenAI)2022.05Imagen(Google)Reverse processForward(diffusion)processAuto-regressive vs.Denois
6、ing Diffusion ModelAuto-regressive Model(AR)Denoising Diffusion Model(Diffusion)Methods2021/02 DALL E(OpenAI)2021/05 CogView(Tsinghua)2021/11 NUWA(MSRA)2022/06 Parti(Google)2021/11 VQ-Diffusion(MSRA)2021/12 GLIDE(OpenAI)2022/04 Dall-E 2(OpenAI)2022/05 Imagen(Google)ProsFast trainingBetter qualityFas