当前位置:首页 > 报告详情

RSR:快速有效的软件升级.pdf

上传人: 明**** 编号:1011621 2025-12-21 16页 564.79KB

1、Rapid Hitful S/W UpgradesAkarsh Gupta(Google)Jason Bos(Cisco)AgendaIntroductionSilicon One Express-BootRSR TypesSONiC IntegrationProblem StatementAI/ML workloads are highly sensitive to packet loss in the fabric network.Software upgrades in fabric need to prevent(or minimize)packet loss.Non-Stop For

2、warding is the preferred software upgrade mechanism on fabric switches.Zero dataplane traffic lossMaximizes fabric availabilityNSF has its own challenges:Large engineering effort to ensure backward compatibility and zero packet loss.Cannot handle all software upgrade use-cases.Cold reboot:Fallback f

3、or NSF upgrades.Pause training jobs,drain racks Exponential increase in training time.Need a fallback software upgrade mechanism that minimizes impact on AI/ML workloads.What is RSR?Rapid Switch Reboot:Near NSF reboot with sub-second dataplane downtime.Express-boot and fast-fast-boot.Dataplane conti

4、nues to forward traffic while CPU is rebooted.Intent reprogrammed is cached in vendor SDK.COMMIT operation:SDK cache is written into the ASIC and pipeline is restarted.Sub-second traffic loss occurs only during COMMIT operation.PreparationPhaseRestorePhaseDisconnect ASICCPU RebootConnect to ASICRepr

5、ogram Intent to CacheRebootPhaseExternal DisconnectReset all tables(except ports)DMA cache into ASICStop PipelineStart PipelineTraffic LossCOMMITNSF/RSR comparisonSDK v1SDK v2SerializeDeserializePersistent storageSDK v1SDK v2Memory DMAFresh configWarmboot/NSFExpress boot/RSRSilicon One Express-BootS

6、tatelessAny-to-Any SDK upgrade or downgrade 50 ms traffic interruptionPorts&Protocols stay upAllow NOS configuration or logic changesE.g.ACL table redefinitionNew P4 program may be loadedSDK v1SDK v2Memory DMAFresh configExpress boot/RSRExpress Boot SequenceV1 DatapathV2 DatapathCritical SectionResu

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **问题与挑战**:AI/ML工作负载对网络中的数据包丢失非常敏感,软件升级需要防止或最小化数据包丢失。 - **非停止转发(NSF)**:NSF是首选的软件升级机制,但具有工程量大、无法处理所有升级用例等挑战。 - **快速交换重启(RSR)**:RSR提供近似NSF的重启,具有亚秒级的控制平面停机时间,适用于软件升级和故障恢复。 - **RSR类型**:包括本地RSR和远程RSR,分别适用于不同的场景和需求。 - **Silicon One Express-Boot**:提供无状态、快速升级或降级,中断时间小于50毫秒。 - **SONiC支持**:SONiC支持本地RSR,并正在努力统一RSR实现和改进NSF故障恢复。 - **网络运营商行动呼吁**:统一SONiC中的RSR实现,改善NSF故障恢复,并鼓励合作和贡献。
AI/ML网络无忧?" 升级新速度!" "零中断升级,网络新体验!"
客服
商务合作
小程序
服务号
折叠