《姜慧强-以 KV 缓存为中心的高效长文本方法的优化和实践.pdf》由会员分享,可在线阅读,更多相关《姜慧强-以 KV 缓存为中心的高效长文本方法的优化和实践.pdf(57页珍藏版)》请在三个皮匠报告上搜索。
1、演讲人:姜慧强oResearch SDE in Microsoft Research Asia(Shanghai)System-Algorithm Co-design Efficient methods to accelerate inference/training0102030405长文本大语言模型的应用和推理挑战当前主流推理优化方法与技术以KV缓存为中心的大语言模型推理架构以KV缓存为中心的高效长文本方法总结与展望01 Massive Pages of Docs Extended Meeting Time Lengthy CodebasesComplex ReasoningEndless
2、 Agentic HistoryLifelong Personalization Almost all latest models can process contexts exceeding 100K tokens.https:/lifearchitect.ai/models/#context-windowsDeepSeek-R1:Incentivizing Reasoning Capability in LLMs via Reinforcement Learning10M tokens PyTorch repository code Lord of the Rings trilogy(1p
3、s)500 reasoning iterative*Long Prefilling Latency,30 minutes to process 1M tokens on an A100 for an 8B LLM.Large GPU Memory Consumption,62GB of GPU memory is required for 512 K tokens in fp16.Long Prefilling LatencyLarge GPU Memory Consumption=MInference=RetrievalAttentionRetrieval AttentionAlignmen
4、t betweenANNS and AttentionKeys&ValuesPrefillDecodeCompressPrefix CachingSparse Atten.KV CacheStoragePromptsTokens gen.Compute3LLMLingua Series:Prompt compression1SCBenchExplore bound of KV cachingMInference 1.0/MMInference:Dynamic sparse prefilling202当前主流推理优化方法与技术(a)Prefix caching is widely used in
5、 LLM framework.(b)Prefix caching is widely used in LLM API.RadixAttentionAutomatic Prefix CachingPrompt CachingContext CachingPrompt Caching03以KV缓存为中心的大语言模型推理架构 Long-context methods are designed and utilized around the KV cache,but existing benchmarks focus only on single-request scenarios,ignoring
6、its full lifecycle in real-world use.(a)Long-Context is shared in real-world scenarios.(b)Prefix caching is widely used in LLM framework.Repo-level Code Debugging/Long-document QAMulti-turn DialogueSelf-play Reasoning(c)Prefix caching is widely used in LLM API.RadixAttentionAutomatic Prefix CachingP