1、Akshay Sharma,Alexander Newkirk,Alexander Sim,Zhengji Zhao,and Won Young ParkLawrence Berkeley National LaboratoryExploring Energy Efficiency in Scientific and Industrial AI WorkloadsExploring Energy Efficiency in Scientific and Industrial AI WorkloadsAkshay Sharma,Alexander Newkirk,Alexander Sim,Zh
2、engji Zhao,and Won Young ParkLawrence Berkeley National LaboratoryOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)Goal:Analyze trade-offs between energy savings and runtime performance of large-scale AI workloads.Methodology:Select models with open source code and training data.Setup the models to run
3、on Perlmutter,a supercomputer system at the National Energy Research Scientific Computing Center(NERSC),LBNL.Train AI models from scratch across multiple GPU nodes.Measure energy savings with GPU power capping vs.runtime for training.System:NERSC provided HPC and storage facilities.Perlmutter is a s
4、upercomputer,named in after Saul Perlmutter,a Nobel Prize winner at Berkeley Lab.Perlmutter is a heterogeneous system,based on HPE Cray Shasta platform.OverviewHPE(Hewlett Packard Enterprise)Cray EX Supercomputer3,072 CPU-only and 1,792 GPU-accelerated nodes.GPU node specs:1 AMD Milan CPU with 64 co
5、res,2 logical coreseach4 x NVIDIA A100 GPUs(40GB)GPU power cap range:100 W-400 W(TDP)Applied through SLURM:#SBATCH-gpu-powerHardware Architecture-Perlmutterhttps:/www.nersc.gov/what-we-do/computing-for-science/perlmutter(Credit:Thor Swift,Berkeley Lab)Source:https:/docs.nersc.gov/systems/perlmutter/
6、architecture/NERSC provides tools for profiling jobs,through which we obtained a time series of power consumption for all the nodes and their components(GPU,CPU and Memory).The node and component power is measured by Cray power monitor counters.The data is available on a scale of seconds.The metrics