当前位置:首页 > 报告详情

OCP静默数据损坏工作组更新.pdf

上传人: 明**** 编号:1011663 2025-12-21 19页 1.05MB

1、Harish Dixit,MetaCyril Meurillon,NVIDIASilent Data Corruption UpdatesSDC Challenges in AISilent Data Corruption UpdatesHarish Dixit,MetaCyril Meurillon,NVIDIAHardware MAnagementSilent Data Corruption(Refresher)Defectsin silicon2+2=5Silent Errors in Compute Units Hard to detect Undetected formonths/y

2、earsSignificant impact to servicesWhat makes SDCs different?Faulty DeviceTypical Fault ManagementECCs,Logs,Counters,RAS Features SDCs in AI workloadsmay cause numeric explosions,e.g.NaN or subtly undermine model accuracyGot NaN?SDC:Hardware faults that go undetected,subtly undermines AI model accura

3、cy and trustworthiness.Growing Urgency:AI/ML at HPC scale(billions of parameters,thousands of nodes)amplifies SDC impact.Difficulty in Correlation:Hard to link low-level hardware errors to high-level AI performance.Insidious Nature:Produces incorrect outputs without triggering alarms.Impact on Trust

4、:Compromises the integrity&reliability of AI systems,especially in critical applications.SDC Challenges in AIDrive solutions and best practices that prevent and detect SDCs.Create awareness about SDC challenges across the computing community.Partner&engage with the academic community to actively add

5、ress growing SDC challenges.OCP Server Resilience,SDC Working GroupSpecification 1.0:released in 2024Training Implications:NaN Propagation:SDCs can lead to Not-a-Number values,propagating across clusters and causing halts and significant debug time.Corrupted Gradients:Subtle corruptions can cause tr

6、aining to stall or diverge.Computational Inefficiency:Wastes weeks/months of valuable computational resourcesSDC Impact on AI TrainingInference Implications:Incorrect Results:Corrupted devices yield inaccurate predictions,directly affecting critical decisions.Costly Triage:Identifying and quarantini

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据标记内容,全文主要围绕“Silent Data Corruption (SDC) 在AI工作负载中的挑战与解决方案”展开。以下是关键点: 1. SDC:硬件故障未被发现,会微妙地影响AI模型准确性和可信度。 2. SDC影响:在HPC规模下,AI/ML的SDC影响加剧。 3. SDC挑战:难以关联低级硬件错误与高级AI性能。 4. SDC影响AI训练:NaN传播、梯度损坏、计算效率低下。 5. SDC影响AI推理:结果错误、成本高昂的故障排除、模型效果降低。 6. 解决方案:多层策略、结合基础设施与工作负载级解决方案、软件与硬件机制。 7. 新基础设施解决方案:主动测试、分析方法。 8. 新工作负载级解决方案:算法容错、发散检测与梯度裁剪。 9. 新优化技术:确定性训练、参数脆弱性因素。 10. 开放研究:跨学科合作,解决度量差距、检测、可扩展解决方案。
SDC揭秘" "如何应对AI中的数据损坏挑战?" "SDC对AI系统的影响及解决方案"
客服
商务合作
小程序
服务号
折叠