1、袁粒袁粒 Li YuanSchool of ECE,Peking University Shenzhen Graduate School生成未必理解:基于扩散模型能否实现视觉世界模型?“What I cannot create,I do not understand”Richard Feynman“What I can generate,I do understand”“What I can understand,I do generate”?基于扩散的生成真的理解了吗?Prompt:Step-printing scene of a personrunning,cinematic film s
2、hot in 35mm.Prompt:Fivegraywolfpupsfrolicking and chasing each otheraround a remote gravel road跑步机上的人反着跑步,不符合逻辑杯子水先撒出来,后破碎,不符合事实小狼的数量时而五只,时而三只或四只Prompt:Glass shattering with red liquid and ice cubesDiffusions Beat GANsV.S.V.S.Dhariwal,Prafulla,and Alexander Nichol.Diffusion Models beat Gans on Image
3、 Synthesis.NeurIPS 34(2021):8780-8794.4基于Diffusion的视觉生成发展DDPM was proposed in JuneProposed DDIMLatent Diffusion Model(LDM)was processed2020202120222023LoRA for Diffusion was proposed,quickly adopted for various applicationsGoogle proposed V1 of the Video Diffusion ModelVideo generation apps Pika V1,
4、Runway Gen1 and Gen2,and Stable Video Diffusion emergedAcademic ProgressApplication ProgressOpenAI proposed DALL-E,based on Transformer not DiffusionTHU proposedCogView,a text-to-image model based on Transformer,following DALL-EBased on Stable Diffusion,hit applications like MidJourney V1-V4 emerged
5、Stability AI open-sourcedStable Diffusion V1 and V2OpenAIreleasedSora,aT2V model,but no API access yet2024T2I-Adapter(PKU)andControlNet(Stanford)were proposed for precise T2I controlOpenAI proposedDALL-E 2,based on DiffusionCLIP:Aligning text and image spaces,later widely used for T2IKeling Model by
6、 Kuaishou&Vidu by Shengshu&Open Sora plan by PKUMeta proposedDiffusion Transformer,replacing U-Net with TransformerHuawei proposed the T2I model PixArt-based on DiTShanghai AI Lab proposedLatte,a T2V model based on DiT5视觉生成和视觉理解两条路线完全割裂6生成未必理解、理解不能生成未必理解、理解不能(视觉)生成;(视觉)生成;建模方式不一样:视觉生成依靠扩散模型,视觉理解依靠建模