当前位置:首页 > 报告详情

探索科学和工业人工智能工作负载中的能源效率.pdf

上传人: 明**** 编号:1011837 2025-12-21 22页 2.70MB

1、Akshay Sharma,Alexander Newkirk,Alexander Sim,Zhengji Zhao,and Won Young ParkLawrence Berkeley National LaboratoryExploring Energy Efficiency in Scientific and Industrial AI WorkloadsExploring Energy Efficiency in Scientific and Industrial AI WorkloadsAkshay Sharma,Alexander Newkirk,Alexander Sim,Zh

2、engji Zhao,and Won Young ParkLawrence Berkeley National LaboratoryOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)Goal:Analyze trade-offs between energy savings and runtime performance of large-scale AI workloads.Methodology:Select models with open source code and training data.Setup the models to run

3、on Perlmutter,a supercomputer system at the National Energy Research Scientific Computing Center(NERSC),LBNL.Train AI models from scratch across multiple GPU nodes.Measure energy savings with GPU power capping vs.runtime for training.System:NERSC provided HPC and storage facilities.Perlmutter is a s

4、upercomputer,named in after Saul Perlmutter,a Nobel Prize winner at Berkeley Lab.Perlmutter is a heterogeneous system,based on HPE Cray Shasta platform.OverviewHPE(Hewlett Packard Enterprise)Cray EX Supercomputer3,072 CPU-only and 1,792 GPU-accelerated nodes.GPU node specs:1 AMD Milan CPU with 64 co

5、res,2 logical coreseach4 x NVIDIA A100 GPUs(40GB)GPU power cap range:100 W-400 W(TDP)Applied through SLURM:#SBATCH-gpu-powerHardware Architecture-Perlmutterhttps:/www.nersc.gov/what-we-do/computing-for-science/perlmutter(Credit:Thor Swift,Berkeley Lab)Source:https:/docs.nersc.gov/systems/perlmutter/

6、architecture/NERSC provides tools for profiling jobs,through which we obtained a time series of power consumption for all the nodes and their components(GPU,CPU and Memory).The node and component power is measured by Cray power monitor counters.The data is available on a scale of seconds.The metrics

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
客服
商务合作
小程序
服务号
折叠