Alibaba has unveiled Wan 2.1, its latest open-source video foundation model, designed to generate high-quality videos with realistic motion and physics-based accuracy. The model suite includes Wan2.1-I2V-14B, Wan2.1-T2V-14B, and Wan2.1-T2V-1.3B, supporting text-to-video, image-to-video, and video editing capabilities at 480P and 720P resolutions. Notably, the T2V-14B model is the first of its kind to generate videos with both Chinese and English text, while the T2V-1.3B model is optimized for consumer-grade GPUs, requiring only 8.19 GB of VRAM to generate a five-second video in four minutes on an RTX 4090. Outperforming OpenAI’s Sora on the VBench Leaderboard, Wan 2.1 leads across key benchmarks such as motion smoothness, spatial relationships, and temporal stability.
The model’s performance advancements are driven by a 3D causal variational autoencoder (VAE), Flow Matching framework, and Diffusion Transformer (DiT) paradigm, all optimized for faster video reconstruction and memory efficiency. Alibaba’s large-scale data pipeline curated 1.5 billion videos and 10 billion images to train the model, enhancing its realism and coherence. Tests show Wan 2.1 reconstructs videos 2.5 times faster than HunYuanVideo on an A800 GPU, with even greater speed gains at higher resolutions. Alongside this breakthrough, Alibaba recently introduced QwQ-Max-Preview, an advanced reasoning model in its Qwen AI family, as part of a broader strategy to invest $52 billion in AI and cloud computing over the next three years.
Click here to read the entire article on Analytics India