当前位置:首页 > 报告详情

网络集体加速人工智能架构.pdf

上传人: 明**** 编号:1011952 2025-12-21 13页 885.50KB

1、Surendra Anubolu-BroadcomNikhil Shetty-OracleIn Network Collective acceleration for AI FabricsIn Network Collective Acceleration for AI FabricsSurendra AnuboluNikhil ShettyNetworkingBandwidth and Latency needs for AI fabricChallengesProposed In Network Collectives for Ethernet fabricShare data Tomah

2、awk Ultra In Network collective performanceCall To Action Infrastructure-APIsAgendaCollectives account for 90+%of the bandwidthAll Reduce All to AllAll GatherReduce ScatterLarge models sizes Very high bandwidthInference Low latency completionsMoE k of N multicastAI Fabric Bandwidth and latency chall

3、enges-Collectives consume most of the fabric bandwidth-Tensor Parallel and Expert Parallel have communications that are exposedAI workload traffic patternsParallelismCollectiveFabric loadTensor ParallelAllReduce50%Expert parallelAllToAll,Gather1 to 10%Sequence parallelAll Gather30 to 40%Data paralle

4、lAll Reduce 5%Pipeline PrallelP2P0.2%Example collective data transfer usageWhy offload-High-bandwidth communication is a major component of collectives-Network switches have one or two order magnitude more fabric bandwidth than end points-Predictable latency+Tail latency-Collectives require very lit

5、tle compute-51Tbps requires only 3 TFlops of BF16 adders-Some collectives like k of N do not require any computeFabric is a natural place to accelerate collectivesIn Network Collectives-offloadGPU1000 TFlops400 G-7 TbpsSwitch with INC3 Tflops50 TbpsSwitch participates in the collectiveOffloads the c

6、ollective compute such as all_reduceAt the start of the job,INC Manager allocates switch compute resources and builds a treeTree can be reused for multiple collectivesINC manager can work with load sharing facility to reserve resourcesxCCL,libfabric and MPI pluginsArchitecture fo

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **网络集体加速**:网络集体(Collectives)在AI fabrics中占90%以上的带宽,用于处理高带宽通信和低延迟完成。 - **带宽和延迟挑战**:AI fabric面临带宽和延迟挑战,特别是对于大型模型和推理任务。 - **网络集体性能**:Tomahawk Ultra支持网络集体,提供可预测的性能和低尾延迟,集体时间完成提高2倍。 - **INC架构**:In Network Collectives(INC)通过交换机参与集体计算,减少端点计算需求。 - **OCI性能挑战**:Oracle Cloud Infrastructure(OCI)面临集体算法执行和数据传输的挑战。 - **INC优势**:INC减轻了高层拥塞,降低完成时间,提高尾延迟,并释放计算资源。 - **OCI鼓励INC**:OCI鼓励使用开放接口实现INC,支持可配置性和监控,以及可扩展的 fabrics。
集体如何助力?" 带宽与延迟挑战" 如何优化AI工作负载?"
客服
商务合作
小程序
服务号
折叠