当前位置:首页 > 报告详情

设计、构建和测试具有行业领先性能的OCP人工智能网络.pdf

上传人: 明**** 编号:1012015 2025-12-21 13页 1.14MB

1、JM HandsMatt RomanMarc AustinDesign,Build and Test an OCP AI Cloud Network with Industry Leading PerformanceNETWORKINGOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)JM HandsMatt RomanMarc AustinDesign,Build and Test an OCP AI Cloud Network with Industry Leading PerformancePanel DiscussionJM HandsCEO,F

2、armGPUMatt RomanSr Director,PLMCelesticaMarc AustinCEO,HedgehogDesignOCP NetworkingLearn more2U 64-port 800GbE Data Center SwitchAI/ML&Big Data AnalyticsHyperscale Data Centers&Cloud ComputingHigh-Performance Computing(HPC)Network Backbone(800GbE Data Center Leaf/Spine)NETWORKINGCelestica DS5000800G

3、bE SwitchOCP Networking SoftwareNETWORKINGBuild17 Day Crash Course on AI Networking17 DaysMay 23Aug 1Aug 20Jul 16Jul 17 Aug 1Aug 15July 17Equipment OrderedEquipment OrderedNCCL TestNCCL TestEquipment On SiteEquipment On SiteOptics IssueOptics IssueLots of CollaborationLots of CollaborationGo LiveGo

4、LiveSold OutSold OutAI Network is a Lot More Than a SwitchComponentLesson LearnedBetter Next TimeCablingEasy to make mistakes,different types of MPO,dust,etc.Use host and switch software to confirm cablingOpticsVery little interoperability,need to validate EVERY optic with switchValidate BOM to ensu

5、re compatible optics.Management software to provide detailed optics status.Software to identify anomalies.BIOSDisable IOMMU and PCIe ACS for max performance on NCCLManagement software to validate host BIOS settingsOS kernelBlackwell NVIDIA driver workaround for Ubuntu 24.04/Kernel 6.8Management soft

6、ware to validate versions and check known issuesDriversMellanox OFED drivers,RDMA setup,Blackwell supportManagement software to automate configuration of host networkingKernel modulenvidia-peermem,DOCA(See above)NICMST tools,disable autoneg,400G force link,tur

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据标记内容,全文主要内容概括如下: - **OCP AI云网络设计、构建与测试**:探讨如何设计、构建和测试具有行业领先性能的OCP AI云网络。 - **关键参与者**:包括FarmGPU的JM Hands、Celestica的Matt Roman和Hedgehog的Marc Austin。 - **技术细节**: - 使用2U 64端口800GbE数据中心交换机。 - 采用Celestica DS5000 800GbE交换机和OCP网络软件。 - 网络测试包括NCCL测试和RoCE QPN Hashing Mode。 - **挑战与解决方案**: - 确保光学兼容性和验证。 - 使用管理软件验证硬件和软件设置。 - **未来计划**: - Hedgehog将提供AI训练参考网络设计,包括AI训练 fabrics、SONiC和Celestica DS5000交换机。 - 预计2025年第四季度提供参考设计。 - **资源**: - Hedgehog代码和文档可在GitHub和Hedgehog官网获取。
"OCP AI网络设计挑战" "800GbE交换机性能揭秘" "AI云网络构建攻略"
客服
商务合作
小程序
服务号
折叠