当前位置:首页 > 报告详情

通过将计算和通信卸载到智能网卡来加速高性能计算和人工智能应用.pdf

上传人: 明**** 编号:1012021 2025-12-21 34页 1.85MB

1、Dhabaleswar K(DK)PandaThe Ohio State UniversityAccelerating HPC and AI Applications by Offloading Computation and Communication to SmartNICS2Trends in Modern HPC ClustersAccelerators(such as GPUs)High compute powerHigh peak memory bandwidth(H100:900 GB/s NVLINK)High Performance Interconnects InfiniB

2、and(DPUs),Omni-Path,EFA Better PerformanceCatch:Who will progress communication(Can we dedicate this task to DPU cores?)Concept of Non-blocking CollectivesApplicationProcessApplicationProcessApplicationProcessApplicationProcessComputationCommunicationCommunicationSupport EntityCommunicationSupport E

3、ntityCommunicationSupport EntityCommunicationSupport EntityScheduleOperationScheduleOperationScheduleOperationScheduleOperationCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifCompleteCheck ifComplete11Network Based Computing LaboratorySIAM-PP(M

4、ar 24)Major Opportunity and BenefitsOverlap of Computation with CommunicationReducing the overall application execution time12Network Based Computing LaboratorySIAM-PP(Mar 24)Major ChallengesSuitable Host-DPU-Network communication mechanismsEfficient Non-blocking Collective Algorithm offloadLoad bal

5、ancing across ARM cores to take care of the offloading tasksRe-designing applications/middleware using the offloaded strategies to extract higher performance benefits13MVAPICH2-DPU Library ReleaseSupports all features available with the MVAPICH2 release(http:/mvapich.cse.ohio-state.edu)Novel framewo

6、rk to offload non-blocking collectives to DPUOffloads non-blocking Alltoall(MPI_Ialltoall)to DPUOffloads non-blocking Broadcast(MPI_Ibcast)to DPUAvailable from X-ScaleSolutions as a commercial product,please contact contactusx-.14Total Execution Time with osu_Ialltoall(32 nodes),BF-2Benefits in Tota

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,全文主要内容概括如下: 1. **MVAPICH项目概述**:这是一个高性能开源MPI库,支持多种互连和平台,自2001年起发展,支持最新的MPI-4.1标准,被全球超过3,450个组织使用。 2. **加速策略与优势**:通过将计算和通信卸载到智能网卡(如BlueField-3 DPU),实现非阻塞集体通信和非阻塞点对点通信,显著提高应用性能。 3. **非阻塞集体通信**:例如,Ialltoall和P3DFFT,Ibcast和HPL,通过DPU卸载实现计算和通信重叠,减少整体应用执行时间。 4. **非阻塞点对点通信**:如3D Stencil应用,通过GVMI卸载MPI_Isend/MPI_Irecv,提高数据交换效率。 5. **PETSc优化**:通过修改求解器算法,将减少数据移动成本的向量-乘加(VMA)、分布式点积(DDOT)和矩阵-向量(MATVEC)操作卸载到DPU。 6. **深度学习训练加速**:利用DPU卸载数据增强和模型验证,加速深度神经网络训练,如ResNet-20v1模型在CIFAR10数据集上的训练,性能提升高达19%。 7. **结论**:DPU技术为加速MPI、OpenSHMEM和深度学习应用提供了新的途径,但需要考虑DPU的额外成本。
**DPU加速HPC应用** **MVAPICH助力AI性能** **深度学习加速新篇章**
客服
商务合作
小程序
服务号
折叠