1、2025-03-12Gemma 3 Technical ReportGemma Team,Google DeepMind1We introduce Gemma 3,a multimodal addition to the Gemma family of lightweight open models,rangingin scale from 1 to 27 billion parameters.This version introduces vision understanding abilities,a widercoverage of languages and longer contex
2、t at least 128K tokens.We also change the architecture ofthe model to reduce the KV-cache memory that tends to explode with long context.This is achieved byincreasing the ratio of local to global attention layers,and keeping the span on local attention short.The Gemma 3 models are trained with disti
3、llation and achieve superior performance to Gemma 2for both pre-trained and instruction finetuned versions.In particular,our novel post-training recipesignificantly improves the math,chat,instruction-following and multilingual abilities,making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-2
4、7B-IT comparable to Gemini-1.5-Pro acrossbenchmarks.We release all our models to the community.1.IntroductionWe present the newest version of Gemma openlanguage models(Gemma Team,2024a),co-designed with the family of Gemini frontier mod-els(Gemini Team,2023).This new versioncomes in sizes comparable
5、 to Gemma 2(GemmaTeam,2024b),with the addition of a 1B model.These models are designed to run on standardconsumer-grade hardware such as phones,lap-tops,and high-end GPUs.This version comeswith several new abilities to the Gemma family;namely,multimodality,long context,and mul-tilinguality,while pre
6、serving or surpassing theperformance of prior versions.In terms of multimodality,most Gemma 3 mod-els are compatible with a tailored version of theSigLIP vision encoder(Zhai et al.,2023).Thelanguage models treat images as a sequence ofsoft tokens encoded by SigLIP.We reduce the in-ference cost of im