当前位置:首页 > 报告详情

SONiC 的 AI 集群规模化方法.pdf

上传人: 明**** 编号:1011635 2025-12-21 22页 1.80MB

1、SONiC Scale Up Introduction Eddie Ruan,Alibaba Riff Jiang,MicrosoftAI NetworksIn RackIn-RackScale-Up NetworkAcross RacksScale-Out NetworkAcross RacksAcross Data CentersScale-Up vs Scale-OutScale-Up NetworkScale-Out NetworkNumber of machinesMultiple interconnected nodesIndependent nodes with distribu

2、ted resourcesExamplesNVL72,PCIeEthernet based AI clustersCommunication CharacteristicsLow latencyHigh bandwidthMemory load-store-atomicsSmaller transfersHigher latencyLower bandwidthMessage transferLarge size transfersScalabilityLimited by GPU and cluster designHorizontal ScalingJob schedulingAlmost

3、 the sameNetwork Performance 2us RTT 20us RTTWorkload typesTightly coupled tasks with high inter-node communicationLoosely coupled tasks(e.g.,data parallelism)Model sizeVery large models that require significant memoryModels that can be split across nodesMemory ArchitectureShared memoryDistributed m

4、emoryNo shared global memoryParallelismRequired for PP and TP.Best for DPWhy do we need Scale up?https:/arxiv.org/pdf/2505.09343 Insights into DeepSeek-V3:Scaling Challenges and Reflections on Hardware for AI ArchitecturesScale OutScale Up SONiC Scale Up WGhttps:/lists.sonicfoundation.dev/g/SONiC-Sc

5、ale-Up-WGMicrosoft,Alibaba Invited Tencent and Bytedance to join Scale Up WGWeekly Meetings Every Tue 6-7pm PSThttps:/lists.sonicfoundation.dev/g/SONiC-Scale-Up-WG/wiki/39581Alibabas ThoughtsApplication ViewChip ViewLarge data packet sizeExpect to support 256 GPUs in the cluster with 512 GPUs as a s

6、tretch goal.Large BandwidthLow end to end latencyMatch HBM access with network bandwidth and packet sizeMaintenanceRack LevelMulti-tenant supportsEnhanced visibility via telemetryAir cooled SystemsLiquid cooled systemsMicrosofts ThoughtsApplication ViewNetwork ProtocolPacket sizes 1K 8kDepending on

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《SONiC Scale Up Introduction》内容,全文主要围绕SONiC Scale Up网络架构展开,涉及以下关键点: 1. **Scale-Up vs Scale-Out**:Scale-Up网络通过增加节点来扩展,而Scale-Out网络通过增加独立节点来扩展。 2. **通信特性**:Scale-Up网络具有低延迟和高带宽,适用于紧密耦合任务;Scale-Out网络具有较高延迟和带宽,适用于松散耦合任务。 3. **性能指标**:Scale-Up网络RTT小于2us,Scale-Out网络RTT小于20us。 4. **应用场景**:Scale-Up适用于大型模型,Scale-Out适用于可分割模型。 5. **硬件要求**:液冷系统、高带宽、低延迟、多租户支持等。 6. **协议栈**:包括PHY、Link、Ethernet MAC、Adaptation、Transport、Network、Transaction等。 7. **架构文档**:包括设计规范、SAI API定义、模块设计等。 8. **工作组成员**:包括Microsoft、Alibaba、Tencent和Byte Dance等。
"Scale-Up网络性能揭秘" Scale-Up与Scale-Out" AI集群的Scale-Up挑战"
客服
商务合作
小程序
服务号
折叠