当前位置:首页 > 报告详情

微软展示了人工智能训练数据中心的电源稳定性解决方案.pdf

上传人: 明**** 编号:1011841 2025-12-21 15页 1.93MB

1、Power stabilization for AI training datacentersMicrosoftOpenAINVIDIAA true cross-company collaborationhttps:/aka.ms/PowerStabilizationMicrosoftOpenAINVIDIAas nodes simultaneously transition between compute-intensive(high power)and communication-intensive(low power)phasesProblem statementPower oscill

2、ations can cause generation equipment damage and flickerCan de-stabilize the interconnectionPreviously seen failures by the industryHigh Frequencies3 30 HzLow Frequencies0.1 2 HzTime domainMax permitted rate of increase in power demand(MW/s)(MW/s)Allowed short-term deviation in power drawbefore ramp

3、 constraints are triggeredFrequency domainFor each rangeExploring solutions across the stack GPU power shaping to meet datacenter requirements NVIDIA GB200 implementationBounding box controller Minimum power floor/MPF 20%-90%of TDPRamp rate controller Ramp-up/down rates,Hysteresis for ramp down In-b

4、and and out-of-band support Cumulative lifetime associated with the feature Accounting for the EDPp and MPF range,achievable swing of 20%Can work in synergy with other mitigation methodsGPU power smoothing/Min Power Floor(MPF)Power-hungry secondary workload Artificial workload,or a low-priority usef

5、ul job Low context to avoid performance impact to the training job 5%impact achieved using MPS Ability to increase consumption up to 100%of the TDP Telemetry GPU activity counters Fine-grained telemetry Start-up and back-off mechanism Trade-off between potential performance impact and power swing ma

6、sking granularitySoftware Mitigation(Firefly)An energy-storage solution 1.That directly measures the load,has enough capacitance to support the workload,2.Meets the sudden rise/drop needs in power,and switches modes between charging and discharging quickly.Ene

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据标记内容,全文主要关于AI训练数据中心电力稳定性的解决方案。以下是关键点: 1. **问题**:电力振荡可能导致设备损坏、闪烁,并影响电网稳定性。 2. **频率范围**:高频(3-30 Hz),低频(~0.1-2 Hz)。 3. **解决方案**: - **GPU功率调整**:通过NVIDIA GB200实现,包括最小功率地板(MPF)和斜率控制器。 - **软件缓解(Firefly)**:能量存储解决方案,直接测量负载,快速切换充电和放电模式。 4. **组合方法**:结合能量存储、GPU功率平滑和Firefly,以实现更稳定的电力供应。 5. **重要性**:数据中心的电力分布层级和直流域的解决方案至关重要。 6. **未来方向**:改进规格和指导,建立标准,如遥测、负载信号、亚同步振荡缓解和能量存储设备。
"数据中心电力稳定,AI训练加速?" "NVIDIA如何优化GPU功耗?" "AI训练,电力波动怎么办?"
客服
商务合作
小程序
服务号
折叠