1、1Methodology and Observation of Congestion Control Impact on MoE Training Job Completion TimeAlex BortokLead Product ManagerOCP Global Summit 20252AI Data Center Fabric Test MethodologyJob Completion Time Topologies Algorithms Data sizes RDMA message sizesPerformance Isolation Noisy neighbors Parall
2、el collectivesLoad Balancing ECMP hashing Traffic Engineering Q-Pair awareness Parallel Q-Pairs Dynamic Load BalancingCongestion Control PFC ECN/DCQCNKeysight,Issue 20243AI Training Job:2022-20233D Model PartitioningPipeline ParallelPipeline ParallelTensor ParallelAttentionFeed ForwardTensor Paralle
3、lAttentionFeed ForwardTensor ParallelAttentionFeed ForwardTensor ParallelAttentionFeed ForwardData ParallelAllReduce4AI Training Job:2024-2025Mixture of ExpertsPipeline ParallelPipeline ParallelTensor ParallelAttentionFeed ForwardTensor ParallelAttentionFeed ForwardTensor ParallelAttentionFeed Forwa
4、rdTensor ParallelAttentionFFN ExpertData ParallelAllReduceAlltoAll-vExpert 1Expert 25DP vs EP Collective Patterns DP:AllReduce One/two neighbors Small#of QPs Bandwidth per QP is concentratedEP:AlltoAll All neighbors Large#of QPs Bandwidth per QP is spread thinOpposites6Collective to QP MappingPer-QP
5、 Bandwidth0123456701234567AlltoAll BWAllReduce BWSource RanksDestination Ranks7Experiment SetupKAI Data Center Builder 4 x 12.8T switches 1 x 8x400GE AresONE 8 ranks x 400GE Fat Tree(Clos)1:1 PFC,ECN&DCQCN8Experiment 1.10 x AllReduceDCQCN=ONPFC Rx=0ECN-CE Rx=0FTC CDF9Experiment 2.10 x AlltoAllDCQCN=
6、ONPFC Rx=0ECN-CE Rx=80 to 800 per portFTC CDF10Experiment 3.10 x(AllReduce,AlltoAll)DCQCN=ONPFC Rx=0ECN-CE Rx=1K to 10K per portFTC CDFAllReduceAlltoAll11Experiment 3.10 x(AllReduce,AlltoAll)cont.DCQCN=ON12Options to improve performanceRemove congestion Rail-o