《在 SageMaker HyperPod 上使用 Ray 实现可扩展且具有弹性的分布式 AI.pdf》由会员分享,可在线阅读,更多相关《在 SageMaker HyperPod 上使用 Ray 实现可扩展且具有弹性的分布式 AI.pdf(24页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A I M 3 6 9Scalable and resilient distributed AI with Ray on SageMaker HyperPodShreyas AdiyodiHe/himFlorian GauterHe/himMark VinciguerraHe/him 2025,Amazon Web Servic
2、es,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Please use this QR Code throughout the session to participate 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AgendaCommon distributed AI training&inference challengesWh
3、at is SageMaker HyperPodWhat is RaySageMaker HyperPod and Ray ArchitectureDemoDiscussion 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Foundation model compute demand outpaces FM development efficiency innovations5Data source:Epoch(2025)LoRA and other parameter-efficientfine-tun
4、ing techniquesFlash Attention 2.0 foroptimized GPU memory accessMixture of Experts architecturefor efficient layer activationDisaggregated serving forhigher per-GPU throughput GRPO for streamlinedreinforcement learningMARKET ADOPTION MILESTONESOCT 20,2022MAY 5,2023NOV 21,2023JUN 8,2024DEC 25,2024JUL
5、 13,2025100 BILLION10 BILLION1 BILLION100 MILLIONDisclosure required at 100 billion petaFLOP under the Executive Order Grok-3Claude 3.7 SonnetDeepSeek-R1DeepSeek-V3Amazon Nova ProGPT-4oGrok-1Llama 2-70B 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Unique challenges to manage ha
6、rdware resources efficiently for training at scaleFault-tolerant strategies for distributed training NetworkingCluster provisioning&management Large-scale data handling 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Unique challenges for inference workloads at scaleResource Utili