中国移动九天人工智能平台 GPU 推理实践.pdf-三个皮匠报告

1、中国移动China Mobile中国移动九天人工智能平台GPU推理实践中国移动通信研究院尹路2020年12月#page#AI推理场景风立Cafe2Chainer2mxnetPddlePddlePYTORCHTersofotheanoTrainingDNN ModelInference#page#推理效率不高的原因每个GPU/节点运行单模型只支持单框架需要定制化开发鑫展展展开发人员需要根据不同应用重新定些系统过载，而另一些空载解决方案只支持来自单一框架的模型制开发#page#NVIDIA TRITON最大化GPUs实时推理性能Tesla T4快速部署、管理多个模型TeslaT4易于扩展到不同架构

2、的GPUs以及多GPU节点TeslaV100与编排系统结合，可以进行metrics监测TeslaV100开源CTesla P4Tesla P4#page#NVIDIA TRITON支持的模型格式TensorFlow GraphDef/SavedModelTensorFlow and TensorRT GraphDefTensorRT PlansCaffe2 NetDef (ONNX import）多GPU支持模型并发实行HTTP RESTAPIgRPCPython/C+ client librarie#page#中国移动可用METRICSCategoryGranularityNameUse C

3、aseFrequencyPer GPUPer secondProxyforloadontheGPUPowerusagPer GPUPower limitMaximumGPupowerimitPersecondGPU UtizationGPUutiizationratePer GPUPer secondGPU utiization0.0-1.0）TotalGPumemoryinbytosPer GPUPer secondGPU Total MemoryGPU MemoryPer GPUPer secondUsedGPU memory，inbytesGPU Usod MemoryPer model

4、PerrustRequestcountNumberofinferencerequestPer modelParreostNumberof model inferetCountExecution countGPU&CPUbatchingPer modelPer requestInforenoe count“batchsizeinferences）Per modelLatency:roquosttimoPer requestEnd-to-endinferoncorequesthandingtimePer modelPer requestLatencyLatency:compute timeGPU&

5、 CPUqUOstsponds waitng in the quouo before beingPer modelPer requostTimeareLatency:qucuc timeexecutd#page#动态BATCHINGbatch size1withinference on the GPUTRITON Inference Server根据用户的定义，将推理请求组合，从而优化性能1）达到模型允许的最大值P2）达到用户定义的最大等待时间例子：8个客户端请求发送到TRITON Inference Server，dynamicbatcher会等待10ms来整P合一个batch为8的请求，然

6、后将他公区们一起发送给GPU做推理西区区UC#page#模型并发执行-RESNET50TensorRT Inference ServerV100 16GB GPU场景1en例子：12个TRTFP16ResNet50实例（每个需要RequestsResNet1.33GBGPUmemory被加载进GPU，可以在5016GBV100上同时执行。当14并发推理请求产生时：每一个实例同时满足一个请求的运行，另外两个进入队列，等待这12个请求完成后被执行。#page#模型并发执行中国热动RESNET.50&DEEP RECOMMENDERTensorRT Inference ServerV100 16GB

中国移动九天人工智能平台 GPU 推理实践.pdf

相关报告