1、Jasmeet BaggaSoftware Engineer,Network InfrastructureDriving Toward Super-intelligencewith Large-scale AI InfrastructureTODAYMeta is an AI company20062024202620282030Number of connected acceleratorsAI Cluster SizeSince 2022,weve seen a major AI infrastructure change6K clustersJob size:128-512 GPUs20
2、2216-24K clustersJob size:16K GPUs2023Software InfraPhysicalInfrastructureModels2024-2025AI at Meta100K+clustersPersonalizedSuper-intelligence requires large-scale infraPrometheus:1GW+cluster by 2026Hyperion:5GW over next few yearsDriving innovationfor AI scale and performanceDataSoftwareComputeNetw
3、orkScale Up DomainGrowingIncreased PowerEnvelopEfficiencyand SafetyScaleup DomainsTODAY2030AI Cluster Driven Opportunities for Rack Scale System Design100 Accelerators70 KW100 Accelerators0.5 MW+/-400V DCPath to Bigger Scale Up DomainDimensional Drivers:1200mm by 1200mmPower and Cooling NeedsWeight
4、ConsiderationsScale-up network:high-bw,low-latency,more acceleratorsThings to ConsiderScale-out,AI FabricsDisaggregated Scheduled Fabric(DSF):Lossless/reliable fabric of switchesTuned for AIProvides flexibility&speed across multiple generations and types of accelerators and NICsNext Step for Scale-o
5、ut,AI FabricsToday:Sharing 2-stage DSF that scales to 18K accelerators!(4x)DSF Dual stage(Building,4x AI ZONE)Size of single DSF L2 zone:18Kx 800G GPUsSDSWR3-stage 2 1128SDSWR3-stage 2 1128ZONE 1ZONE 2ZONE 3ZONE 4Non BlockingNext step for DSF Scale-out FabricWant the details?1:40 PM“Evolving FBOSS to support Generative AI Network Workloads”Physical InfrastructureSoftware InfraModelsLLAMAPyTorch,MAST,TectonicSiliconOnly 1%done-exciting times ahead!