当前位置:首页 > 报告详情

容量感知自适应路由.pdf

上传人: 明**** 编号:1011830 2025-12-21 14页 1.31MB

1、Midhun Somasundaran,Software Engineer,MetaYikai Lin,Research Scientist,MetaCapacity-Aware Adaptive RoutingCapacity Aware Adaptive RoutingMidhun Somasundaran,Software Engineer,MetaYikai Lin,Research Scientist,MetaOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)Metas Backend TopologyProblem StatementSolu

2、tion OverviewResultsCall to ActionAgendaMetas Non Scheduled Fabric(NSF)Topology Non-blocking Planar architecture Modular units(pods)Multiple links between rack and spine Adaptive routing for load balancingReference Network TopologyTwo jobs running in three podsFirst pod is shared by both jobsCongest

3、ion Spread from Remote FailuresSource rack switch is unaware of remote failuresNetwork-wide congestionImpacts all jobs,not just on failed pathsCapacity-Aware Adaptive RoutingSolution:Rack switches reduce traffic towards impacted destinationsMoves congestion closer to sourceProtecting jobs not on fai

4、led pathTopology Learning and Capacity PropagationBGP propagates per-hop capacity infoRack switches learn E2E capacityFailed links result in paths being pruned in the data plane.Propagating Capacity with ReachabilityEach hop records#of paths(A,B,C)in link bandwidth extended communityHow do rack swit

5、ches recover A,B,C values?Encoding Topology and CapacityWe treat lbw_ext_comm(unsigned 32)as a bit vectorEncoding_scheme:rack_id(4),spine_id(4),A(8),B(8),C(8)Data-Plane SolutionEgress paths pruned to prevent remote congestionRack link failures result in pruning paths to destination rackSpine link fa

6、ilures result in pruning paths to all impacted racksResultsCapacity unawareCapacity aware20%performance improvement for jobs on the failure pathJobs outside the failure path remain unaffectedRemote rack link failures significantly increase

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - Meta的Backend Topology采用非阻塞、平面架构,模块化单元(pods)和多个机架与脊之间的连接。 - 面临的问题包括远程故障导致的网络拥塞,影响所有作业,而不仅仅是失败路径上的作业。 - 解决方案是容量感知自适应路由(Capacity-Aware Adaptive Routing),通过Rack交换机减少对受影响目的地的流量,将拥塞移至源处。 - 核心技术包括BGP传播每跳容量信息、Rack交换机学习端到端容量,以及使用位向量编码拓扑和容量。 - 结果显示,容量感知路由使失败路径上的作业性能提高了20%,而其他作业不受影响。 - 呼吁标准化容量传播方案、合作缓解ARS资源耗尽挑战,并整合关键功能以扩大影响。
"Meta如何提升网络弹性?" "容量感知路由如何助力性能提升?" "网络拓扑编码如何优化路由?
客服
商务合作
小程序
服务号
折叠