当前位置:首页 > 报告详情

从人工智能加速器到系统的病毒和特征分析解决方案.pdf

上传人: 明**** 编号:1011733 2025-12-21 20页 1.21MB

1、Jeremy(Jinghan)YangHardware Systems Engineer,Meta Power Virus and Perf evaluation strategy for AI accelerator systems Samu Chakki Hardware Systems Engineer,Meta Richa MishraHardware Systems Engineer,Meta Power Virus and Perf evaluation strategy for AI accelerator systems Jeremy(Jinghan)YangHardware

2、Systems Engineer,MetaSamu Chakki Hardware Systems Engineer,MetaSERVER:AI HW SW CO-DESIGN/NIC/HPCRicha MishraHardware Systems Engineer,MetaWhy do we need Power Virus and Application Power Characterization in AI accelerator system Strategy overview Engineering workstreams Next steps and Call for indus

3、try collaboration Agenda With fast growth of compute demand to power AI accelerator and systems,we see dramatic power increase from Silicon,module to compute/network nodes all the way to rack and beyond.Power Virus and Application workload Characterization deliver coverage to support Power supply/VR

4、 stabilityThermal characteristicsCooling System Qualification ReliabilityThis efforts will help to refine TDP spec points.CSP can further to optimize the efficient power capacity planning.Context Source:Practices and insights into liquid cooling on Metas AI training platforms.Author:Cheng Chen,Yin H

5、ang,Noman Mithani,Chris Malone,Yueming Li,Wenying Zhang,John Fernandes,Kalpak Dhake,Jaret Wyatt,Jarrod Clow,Darron YoungLook into AI accelerator Power DefinitionPowerTime Pmax/EDPTime scale of of s 0.90 xModel assumptions:Temp:85C,Compute,TT partAverage powerWW 0.90 xModel assumptions:Temp:85C,Compu

6、te,CIP,TT partKernel duration Peak powerMonitoring and Telemetry Sideband report Accelerator,Module,system platform and rack level power,thermal,current,voltage sensor,errors.Inband report IO bandwidth,throughput,latencyProcessing Core utilization,performance counters Error statu

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: 1. **AI加速器系统电力需求增长**:随着AI计算需求的快速增长,从硅芯片到整个系统的电力消耗显著增加。 2. **电力病毒和性能评估策略**:提出使用“电力病毒”和应用程序性能特征来支持电力供应稳定性、热特性、冷却、系统认证和可靠性。 3. **策略概述**:包括电力病毒工具开发、模型工作负载监测、测试基础设施建立、数据分析和AI加速器电力压力工具。 4. **工程工作流**:涉及从裸金属到应用程序的测试内容、测试自动化/编排、监控基础设施和数据大规模分析。 5. **关键数据**:例如,TDP(热设计功耗)在数百毫秒到秒的时间尺度上,应用功率在数百毫秒到秒的时间尺度上。 6. **下一步和行业合作呼吁**:强调与行业合作,以优化电力容量规划,并连接生产工作负载与电力特征。 关键点: - 电力病毒和性能特征对AI加速器系统至关重要。 - 需要跨多个层面的测试和数据分析。 - 强调与行业合作以优化电力管理。
"AI加速器功耗挑战解析" "Power Virus工具如何优化AI系统" "AI加速器性能评估策略"
客服
商务合作
小程序
服务号
折叠