1、MiniCPM-V 4.5:Cooking Efficient MLLMs viaArchitecture,Data and Training RecipesTianyu YuZefan WangChongyi WangFuwei HuangWenshuo MaZhihui HeTianchi CaiWeize ChenYuxiang HuangYuanqian ZhaoBokai XuJunbo CuiYingjing XuLiqing RuanLuoyuan ZhangHanyu LiuJingkun TangHongyuan LiuQining GuoWenhao HuBingxiang
2、 HeJie ZhouJie CaiJi QiZonghao GuoChi ChenGuoyang ZengYuxuan LiGanqu CuiNing DingXu HanYuan YaoZhiyuan LiuMaosong SunMiniCPM-V Team,OpenBMBMiniCPM-V 4.5 CodeMiniCPM-V 4.5 ModelAbstractMultimodal Large Language Models(MLLMs)are undergoing rapid progress andrepresent the frontier of AI development.How
3、ever,their training and inferenceeffi ciency have emerged as a core bottleneck in making MLLMs more accessi-ble and scalable.To address the challenges,we present MiniCPM-V 4.5,an 8Bparameter model designed for high effi ciency and strong performance.We intro-duce three core improvements in model arc
4、hitecture,data strategy and trainingmethod:a unifi ed 3D-Resampler model architecture for highly compact encod-ing over images and videos,a unifi ed learning paradigm for document knowledgeand text recognition without heavy data engineering,and a hybrid reinforcementlearning strategy for profi cienc
5、y in both short and long reasoning modes.Compre-hensive experimental results in OpenCompass evaluation show that MiniCPM-V4.5 surpasses widely used proprietary models such as GPT-4o-latest,and signifi-cantly larger open-source models such as Qwen2.5-VL 72B.Notably,the strongperformance is achieved w
6、ith remarkable effi ciency.For example,on the widelyadopted VideoMME benchmark,MiniCPM-V 4.5 achieves state-of-the-art per-formance among models under 30B size,using just 46.7%GPU memory cost and8.7%inference time of Qwen2.5-VL 7B.1IntroductionMultimodal Large Language Models(MLLMs)1,2,3,4,5,6,7 are