1、1Table of contentsChallenges of inference servingA full-stack approach to inference performanceA dual approach to model efficiencyWhat is Red Hat AI?Optimizing models with Red Hat1:Optimizing the inference runtime(vLLM)2:Optimizing the AI modelRed Hat AINext stepsIntroduction39101814121222182047Key
2、terms at a glanceThe evolution of large language models3IntroductionOptimizing AI model inference is among the most effective ways to cut infrastructure costs,reduce latency,and improve throughput,especially as organizations deploy large models in production.This e-book introduces the fundamentals o
3、f inference performance engineering and model optimization,with a focus on quantization,sparsity,and other techniques that help reduce compute and memory requirements,as well as runtime systems like Virtual Large Language Model(vLLM),which offer benefits for efficient inference.It also outlines the
4、advantages of using Red Hats open approach,validated model repository,and tools such as the LLM Compressor and Red Hat AI Inference Server.Whether youre running on graphics processor units(GPUs),Tensor Processing Units(TPUs),or other accelerators,this guide offers practical insight to help you build
5、 smarter,more efficient AI inference systems.34 4Key terms at a glanceActivations are temporary data generated as a model processes information(input tokens),similar to intermediate results produced during a calculation.They typically require high precision for accurate results.Weights are the learn
6、ed parameters or settings of an AI model,much like configuration files or settings in traditional software.They determine how the model analyzes and predicts data and can often function effectively at reduced precision.Understanding model components5 5Quantization reduces the size and resource requi