1、WAN:OPEN ANDADVANCEDLARGE-SCALEVIDEOGENERATIVEMODELSWan Team,Alibaba GroupABSTRACTThis report presentsWan,a comprehensive and open suite of video foundationmodels designed to push the boundaries of video generation.Built upon the main-stream diffusion transformer paradigm,Wanachieves signifi cant ad
2、vancementsin generative capabilities through a series of innovations,including our novelspatio-temporal variational autoencoder(VAE),scalable pre-training strategies,large-scale data curation,and automated evaluation metrics.These contributionscollectively enhance the models performance and versatil
3、ity.Specifi cally,Wanischaracterized by four key features:Leading Performance:The 14B model ofWan,trained on a vast dataset comprising billions of images and videos,demonstratesthe scaling laws of video generation with respect to both data and model size.It consistently outperforms the existing open
4、-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks,demonstrating a clear and signifi cant performance superiority.Comprehensive-ness:Wanoffers two capable models,i.e.,1.3B and 14B parameters,for effi ciencyand effectiveness respectively.I
5、t also covers multiple downstream applications,including image-to-video,instruction-guided video editing,and personal videogeneration,encompassing up to eight tasks.Meanwhile,Wanis the fi rst modelthat can generate visual text in both Chinese and English,signifi cantly enhancingits practical value.C
6、onsumer-Grade Efficiency:The 1.3B model demonstratesexceptional resource effi ciency,requiring only 8.19 GB VRAM,making it com-patible with a wide range of consumer-grade GPUs.It also exhibits superiorperformance compared to larger open-source models,showcasing remarkable ef-fi ciency for text-to-vi