1、SCALING LARGE LANGUAGE MODEL TRAINING USING HYBRID GPU-BASED COMPRESSION IN MVAPICHAamir Shafi,Research ScientistLang Xu,Ph.D.StudentNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/2024 OFA Virtual WorkshopFollow us onhttps:/ 2024 Virtual OFA Workshop2Netwo
2、rk Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop3Network Based Computing LaboratoryLarge Language Models(LLaMA2,GPT4,Claude3)are powerful in various areas(dialogue syst
3、ems,knowledge base,)Model capability scales with number of parameters(100 Million BERT to 500 Billion Megatron-Turing NLG)Training Billion parameter models requires:Parallelism strategies(scaling up to thousands of GPUs)Memory optimization(fitting models within GPUs)Efficient communication(reducing
4、interconnect bandwidth pressure)Training Large Language Model2024 Virtual OFA Workshop4Network Based Computing LaboratoryParallelism StrategiesData Parallelism(DP):Maintains full model replica on each DP rank and takes mini-batch as inputData-intensive gradient synchronization using AllreducePipelin
5、e Parallelism(PP):Shards model layers across devices and executes in a pipeline orderPoint-to-point communication passing activations and gradientsTensor Parallelism(TP):Distributes Matrix Multiplication over different devicesFrequent Allreduce and Allgather communication ensuring correctness3D Para
6、llelism combines DP+PP+TP(Megatron-LM)2024 Virtual OFA Workshop5Network Based Computing LaboratoryMemory OptimizationDeepSpeed ZeRO Optimizer:A novel memory optimization technology for large-scale distributed deep learningEnables training models with billions of parameter among GPUEach GPU only upda