1、Super-Compute System Scaling for ML Training Bill Chang,Rajiv Kurian,Doug Williams,Eric QuinnellPath to General AutonomyModel Architecture Vision,Path Planning,Auto-Labeling New Models Architectures Parameter Sizes Increasing Exponentially Training Data Video Training Data With 4D Labels Ground Trut
2、h Generation Training Infrastructure Training and Evaluation PipelineFlexible System Architecture Software at ScaleAccelerated ML Training SystemComputeI/OMemoryTypical SystemFixed RatioComputeI/OMemoryOptimized ML Training SystemML Requirements EvolvingComputeI/OMemoryDisaggregated System Architect
3、ureFlexible RatioComputeI/OMemoryOptimized ComputeTechnology-Enabled ScalingSystem-On-Wafer Technology-25 D1 Compute Dies+40 I/O Dies-Compute and I/O Dies Optimize Efficiency and Reach-Heterogenous RDL Optimized for High-Density and High-Power Layout Maximize Performance and Yield-Known Good Die and
4、 Fault Tolerant Designs-Each Tile Assembled With Fully Functional Dies-Harvesting and Fully Configurable Routing for YieldTraining TileUnit of Scale-Large Compute With Optimized I/O-Fully Integrated System Module(Power/Cooling)Uniform High-Bandwidth-10 TB/s on-tile bisection bandwidth-36 TB/s off-ti
5、le aggregate bandwidth9 PFLOPS BF16/CFP8 11 GB High-Speed ECC SRAM 36 TB/s Aggregate I/O BWFlexible Building Block9 TB/s9 TB/sScale With Multiple Tiles No Additional Power/Cooling Design NeededTileTileTileTileTileTileComputeI/OMemoryDisaggregated MemoryV1 Dojo Interface Processor32GB High-Bandwidth
6、Memory-800 GB/s Total Memory Bandwidth 900 GB/s TTP Interface-Tesla Transport Protocol(TTP)-Full custom protocol-Provides full DRAM bandwidth to Training Tile 50 GB/s TTP over Ethernet(TTPoE)-Enables extending communication over standard Ethernet-Native hardware support 32 GB/s Gen4 PCIe InterfaceDo