当前位置:首页 > 报告详情

使用 SRv6 构建源路由 AI 后端网络.pdf

上传人: 明**** 编号:1011334 2025-12-21 12页 1.40MB

1、GuohanLuPartner Software Engineering Manager,MicrosoftWith contributions fromAhmed Abdelsalam,CiscoRita Hui,MicrosoftChangrong Wu,MicrosoftSRv6 uSID for HyperscalersSDN use casesMicrosoft Global Cloud NetworkSONiC Is Powering Cloud At ScaleIn Microsoft 90%T0 switches running SON 80%T1 switches runni

2、ng SONStarted rolling out T2 with SONiin 2024New Traffic Pattern:Small number of large flowsPeriodic bursts of data sent synchronouslyChallenges:Traditional passive ECMP-based load balancing mechanisms suffered from low entropy problem.Failures of communications in LLM training are costly An epoch o

3、f training is blocked until the synchronized collective communications of last epoch finishesGPU hours are expensive resources because of tight supplyIf an ongoing training job crashes,all progress since the last checkpoint may be lost.Artificial Intelligence in the CloudRaising the Bar for Hypersca

4、le Datacenter NetworksHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUMachine LearningDeep LearningRecommender SystemNatural LanguageProcessingComputerVisionAI Workloads/ApplicationsT0RLeafLeafT0RT0RAt the backend of a Data CenterProvides fined-grained network con

5、trol based on source routingEnables path enumeration for traffic managementSRv6 for Backend NetworkIntegration with AI workloads flow scheduling provides optimal network performanceAllow source to quickly reroute upon path failures or congestion5Plane 1Plane 2Plane 3Plane 4SRv6 for 2-layer topology

6、using uSID 01T102T101T0usid0 xd001 0 xd002fcbb:bbbb:d100:d001:d1e0:00a0:0 xd100180T1224T00 xd0b30 xd1e0DstIPv6SRv6 packet from NIC32-block1st hop2nd hop3rd hop4th Hopnicnic00a0fcbb:bbbb:d100:d002:d1e0:00a0:32-block1st hop2nd hop3rd hop4th HopWhen congestion or failure is detected

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **Microsoft云网络使用SONiC**:超过90%的T0交换机、80%的T1交换机运行SONiC,2024年开始部署T2交换机。 - **新型流量模式**:少量大型流、周期性同步数据爆发。 - **挑战**:传统ECMP负载均衡机制熵低,LLM训练通信故障成本高,GPU资源紧张,训练中断可能导致进度丢失。 - **SRv6应用**:提供基于源路由的细粒度网络控制,支持路径枚举,优化AI工作负载流量调度。 - **SRv6与uSID**:使用uSID实现路径控制,通过SRv6最大化路径利用率。 - **SONiC与SRv6配置**:通过SONiC配置数据库配置SRv6的静态uSID和定位器。 - **多平面网络**:NIC/DPU在多平面网络中具有多个交换机功能。 - **社区贡献**:鼓励在SONiC社区的所有领域贡献。
"SRv6如何优化超大规模数据中心网络?" "SONiC如何助力云规模网络发展?" "AI工作负载在云网络中的挑战与解决方案?"
客服
商务合作
小程序
服务号
折叠