当前位置:首页 > 报告详情

利用实时健康监测增强人工智能硬件和高性能计算中的 RAS.pdf

上传人: 明**** 编号:1011337 2025-12-21 12页 1.18MB

1、OCP EMEA Summit 29 April|Dublin,IrelandGuy Gozlan,proteanTecsEnhancing RAS in High-Performance Computing with Real-Time Health MonitoringSemiconductors and RASPhysicsSmaller geometries,complex architecturesSoftware stress High-performance applications with increasing/changing workloadsHyper-competit

2、ion Less margins in design,less time to test,shorter time to tape outCost Cannot keep up with demand,refresh delayed(4-6 years),HW needs to last longer more time for failureOperational Lower operational voltages,increased workload demands,unpredictable future workloadsScale High volumes and all conn

3、ected via system clustersFunctional failuresSilent data corruptionSystem-wide errorsCurrent ApproachesSlow ResponseLacking LocationComplex and expensive integrationBIST Running only at startupproteanTecs Multi-Pillar TechnologyNative solution Ecosystem agnostic Smart integrationIP&EDA On-chip HW mon

4、itoring system,integration&implementationSoftware ApplicationsCloud&edge analytics for actionable insightsIn-Production TestingIn-FieldOn-Cloud(SW)On-Board(SW)On-Tester(SW)On-Cloud(SW)On-Board(SW)In-Chip(FW)Real path monitoring High-speed clock samplingPPA adherentFull embedded HW system Monitoring

5、Margin to Timing FailureHigh-coverage&continuous monitoring of actual performance limiting paths with on-chip AgentsAt test and in mission-modeExtreme high coverage of performance limiters pathsSensitive to:-Workload stress-Latent defects-Operating conditions-DC IR drops&local Vdroops-Hot spots-Agin

6、gSufficient Timing MarginLow Timing MarginCritically low Timing MarginLegendDemonstration in 5nm Communication SystemThis slide will include a 90 second video of the RTHM running in a real customer system(with voiceover by the speaker to explain what were seeing)Performance Index

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: - **主题**:增强高性能计算中的可靠性、可用性、可维护性(RAS)通过实时健康监控。 - **挑战**:半导体设计成本高、测试周期长,硬件需更耐用,操作电压降低,工作负载需求增加,系统规模扩大,功能性故障风险增加。 - **解决方案**:proteanTecs的多支柱技术,包括: - 原生解决方案,生态系统无关,智能集成。 - 在芯片上实现的硬件监控系统。 - 软件应用,包括云和边缘分析。 - 在生产中的实时健康监控(RTHM)。 - **优势**: - 高覆盖率和连续监控性能限制路径。 - 基于事件的算法,提供性能指数(PI)。 - 预测故障,避免功能故障、数据损坏和系统级错误。 - **数据**:支持从28nm到2nm的100多个设计,全球足迹广泛。 关键点: - RAS挑战:成本高、测试周期长、硬件需耐用。 - proteanTecs技术:芯片级监控、软件分析、RTHM。 - 优势:高覆盖监控、性能指数、故障预测和避免。
"实时监控,防患未然?" "芯片寿命延长,秘诀何在?" "从故障识别到预防,如何实现?"
客服
商务合作
小程序
服务号
折叠