当前位置:首页 > 报告详情

以牙还牙:面向 AI GPU 系统的 AI 辅助测试_调试流程和日志分析.pdf

上传人: 明**** 编号:1011748 2025-12-21 15页 1.48MB

1、Tommy Yan,GPU Project Lead,Microsoft AzureAnna Mary Mathew,Director,Microsoft AzureFight fire with fire:AI-assisted test/debug flow and log analysis for AI GPU systemsFight fire with fire:AI-assisted test/debug flow and log analysisfor AI GPU systemsTommy Yan,GPU Project Lead,Microsoft AzureAnna Mar

2、y Mathew,Director,Microsoft AzureTEST&VALIDATIONAI Infrastructure scaling and introduction of new technologies creates unique test validation framework that has massive validation data being created for post processingValidation data uses heterogenous formats Debug with massive data is becoming even

3、 more complexFew of the key areas of debug are oRack level connectivity issuesoPower envelope worst case scenariosoPerformance variation at cluster levelProblem statementAI assisted System Test/Debug Flow and Log AnalysisInterested Logs File patterns to scan(e.g.,*BMCSELListDetail*.csv).Error Match

4、TypeERROR Flag if log line contains any error_text keyword,excluding whitelist_text.Match Text Keywords that indicate a problem(error,fail,critical,).Whitelist Text Known safe/irrelevant phrases to ignore(non-critical,Correctable error,).PASS All pass_text keywords must be present in each log entry.

5、Stop-on-Fail Flag Halt test flow on detection if true.Define Error Signature FileFor interested logs:Pre-Search Treatment07 00 ca 24 c2 96 68 37 01 00 02 02 10 00 ff ffRecordName:DramTest Error OEM Event EvtD 1:1st Error ID(DimmDtrResult):DTR_STATUS_NO_FAILUREEvtD 2:2nd Error ID(Fail count):255Log S

6、earchKnown Good Log CompareResult CategorizationGroup by message patternsSeparate matched vs missing signaturesResult De-duplicationFuzzy matching(80%threshold)Merge similar error signaturesMatched Result Post-Search TreatmentResult AnalysisStop-on-fail as behaviorTriggered as ne

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **AI辅助测试/调试流程和日志分析**:微软Azure的Tommy Yan和Anna Mary Mathew介绍了AI辅助的测试和调试流程,用于处理AI GPU系统的日志分析。 - **测试与验证挑战**:AI基础设施的扩展和新技术的引入带来了独特的测试验证框架,产生了大量验证数据,调试大量数据变得更加复杂。 - **关键调试领域**:包括机架级连接问题、功率包络最坏情况、集群级性能变化。 - **AI辅助系统测试/调试流程**:涉及感兴趣日志的扫描、错误匹配类型、错误签名文件定义、已知良好日志比较、结果分析和通知。 - **调试案例研究**:包括GPU链路相关故障和验证混乱的解决。 - **资源信息**:提供了日志分析代码的GitHub链接和OCP测试和验证启用计划参与链接。
挑战与解决方案" GPU系统调试新篇章" AI辅助调试流程揭秘"
客服
商务合作
小程序
服务号
折叠