《DeepSeek:2026基于可扩展查找的条件记忆:大型语言模型稀疏性的新维度技术报告(英文版)(33页).pdf》由会员分享,可在线阅读,更多相关《DeepSeek:2026基于可扩展查找的条件记忆:大型语言模型稀疏性的新维度技术报告(英文版)(33页).pdf(33页珍藏版)》请在三个皮匠报告上搜索。
1、Conditional Memory via Scalable Lookup:A New Axis of Sparsity for Large Language ModelsXin Cheng1,2,Wangding Zeng2,Damai Dai2,Qinyu Chen2,Bingxuan Wang2,Zhenda Xie2,Kezhao Huang2,Xingkai Yu2,Zhewen Hao2,Yukun Li2,Han Zhang2,Huishuai Zhang1,Dongyan Zhao1,Wenfeng Liang21Peking University2DeepSeek-AIzh
2、anghuishuai,chengxin,zengwangding,AbstractWhile Mixture-of-Experts(MoE)scales capacity via conditional computation,Transformers lacka native primitive for knowledge lookup,forcing them to inefficiently simulate retrieval throughcomputation.To address this,we introduce conditional memory as a complem
3、entary sparsityaxis,instantiated via Engram,a module that modernizes classic-gram embedding forO(1)lookup.By formulating the Sparsity Allocation problem,we uncover a U-shaped scaling lawthat optimizes the trade-off between neural computation(MoE)and static memory(Engram).Guided by this law,we scale
4、Engram to 27B parameters,achieving superior performanceover a strictly iso-parameter and iso-FLOPs MoE baseline.Most notably,while the memorymodule is expected to aid knowledge retrieval(e.g.,MMLU+3.4;CMMLU+4.0),we observeeven larger gains in general reasoning(e.g.,BBH+5.0;ARC-Challenge+3.7)and code
5、/mathdomains(HumanEval+3.0;MATH+2.4).Mechanistic analyses reveal that Engram relievesthe backbones early layers from static reconstruction,effectively deepening the network forcomplex reasoning.Furthermore,by delegating local dependencies to lookups,it frees upattention capacity for global context,s
6、ubstantially boosting long-context retrieval(e.g.,Multi-Query NIAH:84.297.0).Finally,Engram establishes infrastructure-aware efficiency:itsdeterministic addressing enables runtime prefetching from host memory,incurring negligibleoverhead.We envision conditional memory as an indispensable modeling pr