当前位置:首页 > 报告详情

面向人工智能后端网络的精确时间协议:建模与部署洞见.pdf

上传人: 明**** 编号:1012029 2025-12-21 16页 1.08MB

1、Ahmad Byagowi,Turba.aiAmit Oren,BroadcomBhaskar Chinni,BroadcomPTP in AI NetworksPTP in AI NetworksAhmhad Byagowi,Turba.aiAmit Oren,BroadcomBhaskar Chinni,BroadcomOCP Special Focus:Artificial Intelligence(AI)Introduction:Need for time synchronization in AI networks Phantom Jam,Phantom Traffic and Ph

2、antom DelayHow TCP determines channel capacityUse cases&benefits of delay awarenessTest dataConclusionAgendaAn emerging behavior of cascading controllersPhantom JamSource:https:/ Increase,Multiplicative DecreaseHow TCP Determines Channel Capacity?Source:https:/ Open Loop backed with Time Slices inst

3、ead of independent controllersPotential SolutionImportance of network for AI workloadsXPUXPUHBMHBMHBMHBMXPUXPUHBMHBMHBMHBM4 x HBM3E(9.6Tbps)38.4Tbps8 x HBM4(12.8Tbps)102.4TbpsBesides improvements in the network speeds,efficiency is also importantEfficiency means effective traffic schedulingOne way l

4、atency(OWL)can be an effective tool for traffic schedulerOWL requires precision time in all the nodesPrecision time is a product of time synchronizationPTP for Network Efficiency(for OWL capability)OWL from host A to host B is the time between As NIC transmit timestamp and Bs NIC receive timestamp f

5、or the same packet.Unlike RTT/2,OWL captures asymmetry(different paths/queuing in each direction)which is common in Clos/leaf-spine fabrics.Why OWL matters in AI workloads:Collectives(e.g.,ring/tree all-reduce)and MoE token routing are barrier-sensitive;tail OWL(p99/p99.9)often controls step time ev

6、en when average latency is low.Microbursts(incast to a single ToR egress)can create millisecond-class queueing spikes that dominate p99 OWL.Production-grade measurement patterns(hardware-assisted):Clock sync:Use PTP(IEEE 1588/802.1AS)with boundary/transparent clocks so both endpoints NIC PHCs are al

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要围绕AI网络中的时间同步和延迟感知展开。以下是关键点: 1. **时间同步的重要性**:在AI网络中,精确的时间同步对于避免延迟和流量问题至关重要。 2. **TCP拥塞控制**:TCP通过“增加乘以减少”(AIMD)方法确定信道容量。 3. **延迟感知**:延迟感知对于有效流量调度和AI工作负载至关重要。 4. **一跳延迟(OWL)**:OWL是衡量网络延迟的有效工具,它捕捉了路径和队列的不对称性。 5. **PTP同步**:使用PTP(IEEE 1588/802.1AS)进行时钟同步,以实现纳秒级的时间同步。 6. **硬件时间戳**:启用网络接口卡(NIC)的硬件时间戳以记录传输和接收时间。 7. **OWL测量**:通过记录和计算每个数据包的发送和接收时间来测量OWL。 8. **优化建议**:通过限制队列、调整拓扑和速度、使用路径/QoS多样性来减轻OWL的影响。 9. **结论**:提倡使用开环系统,推广OWL在AI领域的应用,并考虑升级到支持精确时间同步的硬件。
"PTP同步,AI网络加速利器?" "如何用OWL优化AI网络性能?" "AI时代,网络效率提升秘籍!"
客服
商务合作
小程序
服务号
折叠