《14.针对RISC-V异构计算平台的Triton编译优化.pdf》由会员分享,可在线阅读,更多相关《14.针对RISC-V异构计算平台的Triton编译优化.pdf(13页珍藏版)》请在三个皮匠报告上搜索。
1、Aries WuCTOTerapines(Wuhan)TechnologyOptimizingTritonforRISC-VheterogenousAIcomputing9/22/2025TerapinesConfidential1A g e n d aIntro to TritonHeterogeneous SoCCode GenerationOptimization010203049/22/2025TerapinesConfidential2What is Triton DSL?9/22/2025TerapinesConfidential3Open-sourceGPUprogramming
2、languageforneuralnetworksoriginallydesignedbyOpenAIAdoptedbymanyotherchipvendorsDSLembeddedinPythonPerformanceisonparwithCUDAonNVDIAGPUsBlocklevelprogramminglanguageWhat is a block level language?9/22/2025TerapinesConfidential4jitdefadd(X,Y,Z,N):pid=program_id(0)idx=pid*N+arange(N)mask=idxNx=load(X+
3、idx,mask=mask)y=load(Y+idx,mask=mask)store(Z+idx,x+y,mask=mask)voidadd(int*x,int*y,int*z)tid=threadIdx.x;ztid=xtid+ytid;voidadd(int*x,int*y,int*z,intN)for(inti=0;iN;i+)zi=xi+yi;Triton-blocklevelCUDAthreadlevelCwholedata*N-blocksizeMatrix-Vector Multiplication in Triton9/22/2025TerapinesConfidential5
4、#Y=A X,where A is a matrix of M x N,X is a vector of N.triton.jitdef gemv_kernel(Y,A,X,M,N,stride_am,BLOCK_SIZE_M:tl.constexpr,BLOCK_SIZE_N:tl.constexpr):start_m=tl.program_id(0)rm=start_m*BLOCK_SIZE_M+tl.arange(0,BLOCK_SIZE_M)rn=tl.arange(0,BLOCK_SIZE_N)A=A+(rm:,None*stride_am+rnNone,:)X=X+rn acc=t
5、l.zeros(BLOCK_SIZE_M,),dtype=tl.float32)for n in range(N,0,-BLOCK_SIZE_N):a=tl.load(A)x=tl.load(X)acc+=tl.sum(a*xNone,:,axis=1)A+=BLOCK_SIZE_N X+=BLOCK_SIZE_N Y=Y+rm tl.store(Y,acc)Typical RISC-V based DSAs9/22/2025TerapinesConfidential6RISC-VController+NPURISC-V+RVV+AME/IME/RISC-VController+NPU+DSP
6、A Triton Compiler Stack for RISC-V+NPU9/22/2025TerapinesConfidential7OurMLIRTritonCompilerRISC-V+NPU Optimization Strategies9/22/2025TerapinesConfidential8Recognizecoarse-grainoperatorswhichcanbemappedontoNPUOperatorfuseanddecomposeFallbackunsupportedNPUoperatorsontoRISC-VGenerateasyncDMAdatafetchfo