当前位置:首页 > 报告详情

专题讨论:硬件故障管理在大规模可靠性方面的进展.pdf

上传人: 明**** 编号:1011675 2025-12-21 23页 1.82MB

1、Yogesh VarmaDrew WaltonVilas SridharanCarlos VallinShubhada PugaonkarAnil AgrawalHardware Fault Management-Workstreams Update-HWFM,FMFM,RAS APIHardware Fault Management-Workstreams Update-HWFM,FMFM,RAS APIYogesh VarmaDrew Walton(Microsoft)Vilas Sridharan(AMD)Carlos Vallin(Microsoft)Shubhada Pugaonka

2、r(Intel)Anil Agrawal(Meta)Hardware ManagementHW Fault MGMTPanel on RAS API,FMFM and HWFM Yogesh VarmaCo-Lead OCP:RAS API,HWFM and FMFMVilas SridharanSenior FellowAMDCarlos VallinPrincipal EngineerMicrosoftShubhada PugaonkarPrincipal EngineerIntelAnil AgrawalRAS Lead MetaDrew WaltonPrincipal Engineer

3、MicrosoftSi and Data Center Reliability OCP -Initiatives and FutureYogesh Varma,PhDCo-Lead OCP:RAS API,HWFM and FMFMHardware ManagementHW Fault MGMTFacets of Fleet HW Fault MgmtScalabilityScalabilityVendor AgnosticVendor AgnosticSi AgnosticSi AgnosticObservabilityObservabilityTestabilityTestabilityR

4、AS Feature RolloutRAS Feature RolloutStandardized LoggingStandardized LoggingIn-band and OOB SupportIn-band and OOB SupportServiceabilityServiceabilityDiscoverabilityDiscoverabilityConfigurabilityConfigurabilityDebuggabilityDebuggabilityCalls for a Holistic DC Reliability FrameworkOCP Hardware Fault

5、 Management Today Standard framework for errors classification,logging formats,signaling interface,handling actions and RAS configurationRelated OCP Activities:-Datacenter Debug and Diagnostic-CPU/GPU and System RAS and Manageability-Cloud Infrastructure Management-Si Fault Detection and Mitigation(

6、SDC)Future an OCP Standard End-to-End Data Center Reliability FrameworkIndustry is aligning with OCP Reliability Initiatives join us!Hardware Fault ManagementCreate a standard scalable fault handling framework by collaborating with stakeholders to build a shared knowledge-base and by enhancing exist

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: 1. **OCP硬件故障管理(HWFM)和内存故障管理(FMFM)**:OCP致力于创建一个标准的硬件故障管理框架,包括错误分类、日志格式、信号接口、处理动作和可靠性配置。 2. **RAS API**:定义了数据中心设备与RAS管理软件之间的标准化接口,支持标准化和厂商特定的动作。 3. **FMFM框架**:标准化收集的数据,提供内存错误分析框架,促进内存和CPU厂商合作分析错误。 4. **HWFM框架**:包括错误分类、日志格式、信号接口、处理动作和RAS配置,支持带内和带外错误处理。 5. **标准化和互操作性**:通过标准化接口和框架,实现不同厂商组件的互操作性。 6. **未来方向**:OCP正致力于建立一个端到端的数据中心可靠性框架,并呼吁行业参与。 7. **活动与会议**:OCP定期举办会议,包括HWFM、FMFM和RAS API工作组,鼓励专家参与和贡献。
标准化之路" 内存故障管理新篇章" 硬件故障管理新进展"
客服
商务合作
小程序
服务号
折叠