1、Zikang Xu,Yiming Zhang and Zhirong ShenA Fail-Slow Detection Framework for HBM DevicesASP-DAC 2025OutlineBackgroundUnsuccessful Attempts and LessonsA Fail-Slow Detection Framework for HBM DevicesConclusionMemory wallBackground:Memory Wall3The gap between computing power and memory bandwidthis contin
2、uously widening in modern systems1Processors are improving exponentially,but memory bandwidth is increasing slowly1 Micron Inc.,Microns Perspective on Impact of CXL on DRAM Bit Growth Rate Processor performanceMemory performancePerformanceMemory wall becomes one of the major obstacles in training LL
3、M models.Background:High-Bandwidth Memory4Save massive physical space by stack verticallyOffer significantly higher data transfer ratesIntroduce reduced power consumptionHBM is a hopeful technology to overcome the memory wallDieBuffer dieTSVsSID0SID1Each pseudo channel can be accessed independentlyB
4、ackground:Fail-slow Faults5Recent Studies of Fail-slow FaultsA survey of Fail-slow faultsResearching and detecting fail-slow faults in HDDs and SSDsA fail-slow detection framework for cloud storage systemsFail-slow faults in memory have been less studied.Existing studies basically focus on theoretic
5、al speculations but lack robust validation,replication,and detection tools.OutlineBackgroundUnsuccessful Attempts and LessonsA Fail-Slow Detection Framework for HBM DevicesConclusionDesign Goals7A practical HBM fail-slow detection framework should have several properties.General.Due to the diversity
6、 of HBM devices,our framework aims to be applicable to all HBM devices with little or no modificationsNon-intrusive.If possible,we do not wish to modify or affect the user code.We prefer to use existing workloads and external performance statistics for testingAccurate.This framework should be able t