当前位置:首页 > 报告详情

INSPECT:Meta AI 基础设施上的主动链路故障检测数据分析工具.pdf

上传人: 明**** 编号:1011607 2025-12-21 14页 1.58MB

1、Granthana Rangaswamy,Arushi SharmaMatt Bergeron,Herman Chin,Salina Dbritto Yuvin Weerasinghe,Abhishek Tiwari INSPECT Proactive Link Failure Detection ToolINSPECT Proactive Link Failure Detection ToolGranthana Rangaswamy,Arushi SharmaMatt Bergeron,Herman Chin,Salina Dbritto Yuvin Weerasinghe,Abhishek

2、 Tiwari ARTIFICIAL INTELLIGENCE(AI)This presentation introduces INSPECT A Parametric Analysis Tool that monitors high speed interconnect performance for Meta Datacenter AI racks.The following topics will be discussed:Why is Parametric Analysis needed?Parametric Analysis Implementation Data Collectio

3、n Data Analysis Introducing INSPECT INSPECT Use-Cases Call to ActionPreviewWhy is Parametric Analysis Needed?Metas AI clusters will have millions of SerDes operating at 112 Gbps,224Gbps and beyond.As next generation AI systems emerge,the impact of SerDes related issues is expected to grow due to shr

4、inking margins,increasing speeds and more complexity.To minimize unplanned resource unavailability and job restarts,its crucial to proactively identify SerDes related anomalies.Compute BankSwitch BankBackplaneSerDes and AI clusters High speed fabric channels are typically point to point.The channel

5、behaves like a low pass filter attenuating the high frequency components.Compensating for inter symbol interference(ISI)is critical and is mainly done in digital domain.Above picture is a sample of the DAC/ADC-based architecture and numerous methods that can be used to equalize a lossy channel.Pic C

6、ourtesy:Circuits and Systems for Signal Processing(CASSP)Lab.Parametric Analysis ImplementationParametric Data CollectionSerDes dataForward Error Correction(FEC)statistics Signal to Noise Ratio(SNR)Bit Error Rate(BER)Transmitter Equalization ParametersFeed Forward Equalization(FF

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Data》标记内容,全文主要介绍了Meta公司开发的INSPECT工具,用于监控高速互连性能,特别是针对数据中心AI机架中的SerDes。以下是关键点: 1. **目的**:减少因SerDes问题导致的资源不可用和作业重启。 2. **必要性**:随着AI系统的发展,SerDes相关问题的影響预期增长。 3. **数据收集**:包括SerDes数据(如FEC统计、SNR、BER、均衡参数)和系统级数据(如温度、电压、链路速度等)。 4. **数据分析**:用于主动维护、针对性修复、快速根本原因分析和提高服务性。 5. **应用场景**:包括初始筛选、生产中的系统修复和全系统检测。 6. **呼吁**:需要社区合作以支持在不同ASIC平台上扩展类似INSPECT的工具。
"揭秘Meta数据中心AI性能监控工具" "如何提前识别高速连接故障?" "SerDes通道分析,提升系统可靠性!"
客服
商务合作
小程序
服务号
折叠