温颖-基于强化反馈的大模型自我提升与推理增强.pdf-三个皮匠报告

1、ML-SummitML-Summitwww.cpp-www.ml-summit.orgwww.gosim.orgwww.pm-summit.orgML-SummitML-SummitML-SummitML-SummitML-SummitML-Summit温温颖颖上上海海交交通通大大学学A AI I学学院院长长聘聘教教轨轨副副教教授授上海交通大学人工智能学院长聘教轨副教授，博士生导师。他的研究方向涉及多智能体学习，强化学习及博弈论在其中的应用。他于2020年和2016年分别获得英国伦敦大学学院计算机系博士学位和研究型硕士学位，入选上海海外高层次人才，作为负责人主持国家重点研发计划课题，上海市青

2、年科技英才扬帆计划。他的四十余篇研究成果发表在ICML,NeurIPS,ICLR,IJCAI,AAMAS等相关领域的一流国际会议上，并且获得CoRL 2020最佳系统论文奖，AAMAS 2021 Blue Sky Track最佳论文奖。他连续多年担任ICML,NeurIPS,IJCAI,AAAI,IROS,ICAPS,Operational Research等国际知名会议/期刊的PC成员或审稿人。演演讲讲主主题题：基基于于强强化化反反馈馈的的大大模模型型自自我我提提升升与与推推理理增增强强ML-SummitML-Summit2 20 02 25 5 全球机器学习技术大会基基于于强强化化反反馈馈

3、的的大大模模型型自自我我提提升升与与推推理理增增强强上海交通大学温颖ML-SummitML-Summit2R Re ei in nf fo or rc ce em me en nt t L Le ea ar rn ni in ng g (R RL L)A method to find a policy with high rewards.Reward defines the optimal state and action distribution given the dynamics.K Ke ey y C Co on nc ce ep pt ts s:E En nv vi ir ro o

4、n nm me en nt t (s st ta at te e/o ob bs se er rv va at ti io on n,a ac ct ti io on n a an nd d t th he e d dy yn na am mi ic cs s)R Re ew wa ar rd d (s sc ca al la ar r f fo or r e ea ac ch h s st te ep p o or r e ep pi is so od de e)ML-SummitML-Summit3P Pr ro og gr re es ss s i in n R RL L O Ov ve

5、 er r t th he e P Pa as st t D De ec ca ad de eChampion-level drone racingDiscovering faster matrix multiplication algorithmsAlphaGo Zero,AlphaZero and AlphaStarML-SummitML-Summit4R Re ew wa ar rd d/V Va al lu ue e A At tt te em mp pt t 1 1:A Al lp ph ha aZ Ze er ro o l li ik ke e M MC CT TS S +S SF

6、 FT TTree Search to Enhance LLM Reasoning and TrainingML-SummitML-Summit5F Fr ro om m l li in ne ea ar r d de ec co od di in ng g t to o p pr ri in nc ci ip pl le ed d d de ec co od di in ng g1 1.H Ho ow w t to o s se el le ec ct t b be et tw we ee en n c ca an nd di id da at te e s st te ep ps s?Ev

温颖-基于强化反馈的大模型自我提升与推理增强.pdf

相关报告