1、主主 题题 演演 讲讲时间时间主题主题演讲人演讲人13:30-13:40开场致辞开场致辞李曦鹏李曦鹏13:40-14:20GPU 编程和优化编程和优化 最佳实践分享最佳实践分享刘冰刘冰/郑鹏郑鹏14:20-14:50在在NVIDIA NeMo中实现大语言模型全周期开发中实现大语言模型全周期开发 以以LLaMa2为例为例 姚鑫姚鑫/颜子杰颜子杰14:50-15:20TensorRT Hackathon 2023 总结总结AIGC及大语言模型推理的典型案例深入解析及大语言模型推理的典型案例深入解析季光季光/陈庾陈庾15:20-15:40向量数据库的加速策略和实战向量数据库的加速策略和实战王雍王雍/
2、张静蓉张静蓉15:40-16:00推荐系统的最新优化策略和实践推荐系统的最新优化策略和实践 以以HPS为例为例魏英灿魏英灿/王泽寰王泽寰分组讨论及答疑分组讨论及答疑分组分组主题主题位置位置答疑专家答疑专家1GPU 专家专家:Tensor Core 编程答疑编程答疑 Nsight 示例解析示例解析悦府悦府 10厅厅刘冰刘冰/郑鹏郑鹏/郁凡郁凡/王猛王猛2大语言模型训练大语言模型训练:大语言模型资源分析大语言模型资源分析NVIDIA NeMo 代码代码解析解析悦府悦府 11厅厅颜子杰颜子杰/陶砺陶砺/姚鑫姚鑫3TRT LLM 以及扩散模型以及扩散模型:TRT LLM 代码解析代码解析demoDif
3、fusion 代码解析代码解析悦府悦府 12厅厅季光季光/薛博阳薛博阳/陈庾陈庾/方杰方杰4向量数据库向量数据库:Top-k 答疑答疑RAFT深入讨论深入讨论开放区开放区 左侧左侧(悦府悦府 12厅对面)厅对面)王雍王雍/张静蓉张静蓉/董建兵董建兵5推荐系统的训练与推理推荐系统的训练与推理开放区开放区 右侧右侧(悦府悦府 10厅对面)厅对面)魏英灿魏英灿/王泽寰王泽寰/张耀斌张耀斌/孙凯孙凯欢 迎 致 辞李 曦 鹏NVIDIA 开发与技术部 亚太区总经理GPU 编程和优化 最佳实践分享刘 冰&郑 鹏7GPU 编程和优化 最佳实践分享Petrick Liu 刘冰,Devtech|Perkz Zh
4、eng 郑鹏,Devtech 8CUDA Optimization FundamentalsUnderstand what is Global Memory Coalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda9CUDA Optimization FundamentalsUnderstand what is Understand what is Global Global Memory Me
5、mory Coalesced AccessCoalesced AccessUnderstand what is Shared Memory Bank ConflictWhat are ILP and TLPCase StudyWhy fuse the MHAFMHA as exampleAgenda10GPU ArchitectureGPU:Massive Throughput Machine,Keep the Throughput Maximumfull GH100 with 144 SMsGH100 streaming multiprocessorH100 SXM5:DRAM:3352 G
6、B/sFP32 non-Tensor:66.9 TFLOPSFP16 dense-Tensor:984.9 TFLOPSFP8 dense-Tensor:1978.9 TFLOPS11Understand what is Global Memory Coalesced AccessTypical Example Global memory loads and stores by threads of A Warp are coalesced by the device into as few as possible transactions.Access unit is 32-byte(Als