当前位置:首页 > 报告详情

OCP诊断和调试工作流程简介.pdf

上传人: 明**** 编号:1011630 2025-12-21 10页 1.27MB

1、Marko Bartscherer,IntelEnrico Carrieri,IntelOCP Data Center Diagnostics and Debug Workstream IntroductionWith the rise of large AI systems,nodes and components are becoming more diverse and increasing in size and complexity.Components within a node may be from multiple suppliersVery complex GPU sili

2、con(2KW TDP)Nodes with 50+complex componentsClusters with 1000s of nodes working on 1 jobProblem Statement and MotivationCPUPCIeSwitchesCEM Card EDSFFGPUNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCEM Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM C

3、ard NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor 4Vendor 3Vendor 5Management ServerCPUPCIeSwitchesCEM Card EDSFFGPUNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCE

4、M Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM Card NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor 4Vendor 3Vendor 5Management ServerWith the rise of large AI systems,nodes and com

5、ponents are becoming more diverse and increasing in size and complexity.Components within a node may be from multiple suppliersVery complex GPU silicon(2KW TDP)Nodes with 50+complex componentsClusters with 1000s of nodes working on 1 jobProblem Statement and MotivationCPUPCIeSwitchesCEM Card EDSFFGP

6、UNVMeCEM Card NICCEM Card GPUCEM Card GPUCEM Card GPUEDSFFEDSFFEDSFFCEM Card NICCEM Card NICCEM Card NICPCIeRetimersCPUPCIeSwitchesEDSFFGPUNVMeCEM Card NICGPUGPUGPUEDSFFNVMeEDSFFNVMeEDSFFNVMeCEM Card NICCEM Card NICCEM Card NICPCIeRetimersGPUGPUGPUGPUCPUSwitchSwitchSwitchSwitchVendor 1Vendor 2Vendor

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **AI系统发展**:随着大型AI系统的兴起,节点和组件变得更加多样化和复杂。 - **节点和组件复杂性**:节点内可能包含来自多个供应商的50多个复杂组件,GPU硅芯片功率高达2KW TDP。 - **问题与动机**:数据中心缺乏标准化的数据提取和调试方法,现有解决方案多为供应商特定。 - **目标**:开发通用调试和故障排除工具,支持远程调试、封闭机架调试,并通过插件架构支持特定供应商的负载。 - **工作流程**:创建规范文档,包括接口、协议、API和数据模型,以在数据中心内通信诊断和调试数据。 - **可扩展性**:确保解决方案在整个数据中心生命周期中可扩展,包括访问端口、诊断和调试能力,以及安全和隐私要求。 - **参与与目标**:鼓励所有成员参与OCP诊断和调试工作流程,目标是在2026年底前发布规范。
如何统一调试?" 标准化之路在何方?" OCP如何引领?"
客服
商务合作
小程序
服务号
折叠