《张家驹_vLLM-compile:将编译器优化引入大模型推理.pdf》由会员分享,可在线阅读,更多相关《张家驹_vLLM-compile:将编译器优化引入大模型推理.pdf(29页珍藏版)》请在三个皮匠报告上搜索。
1、张家驹红帽大中华区CTO vLLM:The Open GenAI Inference Platform vLLM:The Open GenAI Inference PlatformvLLM aims to become the Linux of GenAI Inference2 Year Journey Of vLLM2 Year Journey Of vLLMOutlineOutlineWhat is pile?What is pile?Just-in-time compiler for PyTorch codeWhy use pile?Why use pile?Goals for vLLM
2、 x pileGoals for vLLM x pile performance benefits in vLLMpile performance benefits in vLLMpile piecewise pile piecewise cudagraphsCustom pile passes in vLLMCustom pile passes in vLLMCustom pile passes in vLLMCustom pile passes in vLLMCustom pile passes in vLLMCustom pile passes in vLLMCustom pile pa
3、sses in vLLM OOT PlatformCustom pile passes in vLLM OOT Ppile pile cachingvLLM startup time:recent progressvLLM startup time:recent progressFuture work for vLLM x pileFuture work for vLLM x pile Faster startup Improving pile speed Trace only one transformer layer,overlap with weight loading,MORE FUS
4、ION RoPE+cache(+quant)Collective fusion/overlap with compute Helion kernel integration vLLM IR:a semantic intermediate representationvLLM IR:why do we need itvLLM IR:why do we need itvLLM IR:Separating Semantics From ImplementationvLLM IR:Separating Semantics From ImplementationvLLM IR:Sneak PeakvLLM IR:Sneak PeakvLLM IR:OOT Platform AdaptationvLLM IR:OOT Platform AdaptationvLLM IR:BenefitsvLLM IR:BenefitsThanks!Thanks!