当前位置:首页 > 报告详情

构建人工智能架构:面向下一代人工智能服务器、机架和集群的可扩展网络.pdf

上传人: 明**** 编号:1011524 2025-12-21 24页 3.31MB

1、Architecting the Al FabricMetaArchitecting the Al FabricJalpa PatelTechnical Program Manager/MetaAI CLUSTERSLarger AI workloadsSoftware requirementsHardware and Network requirementsData CenterChallenges ahead of usAgendaLlamasScaleSoftware InfraRunning larger AI workloadsHardware and Network Infra D

2、C Infra LlamaScaleRunning larger AI workloadsLlamaSoftware Software Job SchedulingJob SchedulingCheckpointingCheckpointingFault ToleranceFault ToleranceModel Distribution on GPUsTENSOR TENSOR PARALLELPARALLELTENSOR TENSOR PARALLELPARALLELPIPELINE PIPELINE PARALLELPARALLELPIPELINEPIPELINEPARALLELPARA

3、LLELData ParallelSynchronizationGPU1.Technical content is desiredFind Model Sharding Combination,least Sensitive to Network LatencyCo-design Model Sharding with Network Latency/Routing Artifacts2.Modeling,Simulation and ValidationTopology Aware Model Parallelism AssignmentTopology Awareness in Job S

4、cheduler and Model parallelismassignment 3.New Collective AlgorithmsCollective Library Changes,Topology AwarenessMitigating the Impact of Network LatencyNew Collective Algorithms cause:More Congested/New Collective Patterns within the buildingA lot more data across the Buildings-ensuring routing nee

5、ding to be perfect.This means we need Network Routing Efficiency to be Higher than it is todayTwo Directions of Solutions:Packet Spraying and ReassemblyCollective Software Based Load BalancingMitigating the Impact of New Collective AlgorithmsScale Scale Hardware&Hardware&Network Infra Network Infra

6、Running larger AI workloadsLlamaNetworkNetworkFleet HealthFleet HealthHW HealthHW HealthAvenues of Flexibility-TechnologyTechnologyDSFNSF-Forwarding Requirements-DLB/ECMP Scalability-Low Latency-Less Cost-Easier cabling fit-Distance Limitations-VoQ Scalability-Load-Balance in HW-

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Architecting the AlFabric》的内容,以下是全文关键点的概括: 1. **AI工作负载架构**:针对更大规模的AI工作负载,讨论了软件、硬件和网络需求。 2. **LlamaScale**:介绍了一种可扩展的架构,用于运行更大的AI工作负载。 3. **技术挑战与解决方案**: - **模型分片**:寻找对网络延迟最不敏感的分片组合。 - **拓扑感知模型并行**:在作业调度器和模型并行分配中实现拓扑感知。 - **新集体算法**:通过集体库变更和拓扑感知来减轻网络延迟的影响。 4. **网络路由效率**:提高网络路由效率,以应对数据量增加和新的集体算法带来的挑战。 5. **硬件和网络基础设施**: - **多种GPU/加速器类型**:包括Nvidia H100、AMD I300x等。 - **数据中心类型**:多种数据中心类型,如Type1、Type2、Type3、Type4。 6. **服务类型**:包括GenAI和R推理。 7. **数据中心基础设施**:包括数据存储、网络和AI区域。 8. **挑战**:包括可扩展性、异构加速器和异步训练等。
如何优化模型并行?" 新方案揭秘!" 如何实现高效扩展?"
客服
商务合作
小程序
服务号
折叠