当前位置:首页 > 报告详情

无限 HBD:跳出固有思维模式进行规模化.pdf

上传人: 明**** 编号:1011935 2025-12-21 18页 2.04MB

1、Bobby Lu-LightelligenceInfiniteHBD:Scale-up Outside the BoxInfiniteHBD:Scale-up Outside the BoxBobby Lu-LightelligenceSource:Jaime Sevilla and Edu Roldn(2024),Training Compute of Frontier AI Models Grows by 4-5x per Year.Published online at epoch.ai.Retrieved from:https:/epoch.ai/blog/training-compu

2、te-of-frontier-ai-models-grows-by-4-5x-per-year online resourceTraining Compute is Rapidly ScalingMulti-dimensional parallelismLow CommunicationData Parallelism(DP)Pipeline Parallelism(PP)Context Parallelism(CP)Sequence Parallelism(SP)Intensive CommunicationTensor Parallelism(TP)Expert Parallelism(E

3、P)xPU to xPU Scale-Up NetworkLow Latency:RTT 1TBpsHow does the Datacenter Support LLM Training?Switch-centric:Fat tree style of High Bandwidth Domain(HBD)Using many high radix switches to provide high bandwidth,perfect uniform,non-blocking any to any communicationAdditional computing unit to provide

4、 crucial redundancy and serviceabilityMajor Scale-Up Networks UALinkSUENVLinkSource:semianalysis.Retrieved from:https:/ resourceSource:UALink Specification,UALink_200 Rev 1.0Source:Scale Up Ethernet Framework Specification,Scale-Ethernet-RM102Scalability requires high radix switchesChallenges for Sw

5、itch-Centric TopologyScalability requires high radix switchesResource fragmentationSwitch-level fault explosion radiusSolution:Disaggregate the aggregatorChallenges for Switch-Centric TopologyUnusable Bandwidth degradation Transceiver-centric HBDUnify connectivity and switching by using OCS Transcei

6、ver(OCSTrx)InfiniteHBDOCS TransceiverReconfigurable K-Hop RingHBD-DCN OrchestrationC.Shou et al.,SIGCOMM25,September 811,2025,Coimbra,Portugal,https:/doi.org/10.1145/3718958.3750468Module specQSFP-DD formfactor 8ch TX+8ch RX Linear Drive Silicon Photonics OpticsTotal BW up-to 800Gbps single directio

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **AI模型训练计算需求快速增长**:训练前沿AI模型的计算需求每年增长4-5倍。 - **数据中心支持LLM训练**:采用以交换机为中心的拓扑,如Fat tree风格的HBD,提供高带宽和低延迟通信。 - **主要扩展网络**:UALink和SUENVLink,支持非阻塞的任意到任意通信。 - **挑战与解决方案**:交换机密集型拓扑的挑战包括可扩展性、资源碎片化和故障半径。解决方案包括分解聚合器。 - **InfiniteHBD技术**:使用OCS Transceiver实现可重构K-Hop环HBD-DCN编排,提供高带宽、低损耗和快速重构。 - **性能评估**:InfiniteHBD在成本和能效方面优于NVL-72和TPUv4。 - **数据中心实施**:InfiniteHBD与Lightelligence LightSphereX OCS transceiver集成,实现灵活的网络重构和高效扩展。 - **总结**:InfiniteHBD推动网络架构创新,支持大规模AI模型训练。 - **关键数据**:InfiniteHBD比NVL-72成本低3.2倍,比TPUv4成本低1.5倍;能效比NVL-72低1.3倍,与TPUv4相同。
"无限扩展,无限可能?揭秘InfiniteHBD!" InfiniteHBD如何助LLM腾飞?" "颠覆传统,InfiniteHBD网络拓扑革新来袭!"
客服
商务合作
小程序
服务号
折叠