1、海天瑞声在大模型数据的探索与实践海天瑞声CTO兼自动驾驶事业部总经理-黄宇凯目录录大模型数据发展趋势海天瑞声在大模型数据的探索DOTS-LLM赋能大模型数据生产目录录大模型数据发展趋势海天瑞声在大模型数据的探索DOTS-LLM赋能大模型数据生产大模型:迎接新的一轮科技革命“The development of AI is as fundamental as the creation of the microprocessor,the personal computer,the Internet,and the mobile phone.Entire industries will reorie
2、nt around it.Businesses will distinguish themselves by how well they use it.”Bill Gates“If you look at these large language models,they have about a trillion connections and things like OpenAIs GPT-4 know much more than we do.They have sort of common sense knowledge about everything and so they prob
3、ably know a thousand times as much as a person.”Geoffrey Hinton“This is the iPhone moment of artificial intelligenceWhat OpenAI has done,what the team over there has done,genuinely,one of the greatest thing that has ever been done for computing.”Jensen Huang数据来源:Perplexity.AI,中金公司研究部未来趋势落地最快的AI应用之一:
4、AI Answer多模态以数据为中心的AI战略中国未来大模型市场的“雪糕模型”开源 vs.闭源&私有化 vs.公有云模型的中型化/线性化AI应用大繁荣From Copilot to Agent大模型领域未来趋势数据来源:中金公司研究部数据来源:Gartner,Twitter,Ark Invest,Hoffmann J,Borgeaud S,Mensch A,et al.Training compute-optimal large language modelsJ.arXiv preprint arXiv:2203.15556,2022,中金公司研究部10010,0001,000100,0001
5、,000,0001001,00010,000参数量(十亿)Tokens(十亿)$4.6M$0.6M0.13x 成本57x 参数量720 x Tokens“2020年训练最先进的GPT-3的成本为460万美元。根据我们的预测,到2030年,使用比GPT-3多57倍的参数和比其多720倍的Token训练AI模型的成本将从今天的170亿美元降至60万美元。以Wikipedia今天的42亿个单词为例,大约代表56亿个Token。到2030年,训练一个包含162万亿个单词或216万亿个标记的模型应该是可能的。在廉价计算的世界中,数据将成为主要的限制因素。”ARK Invest“We find that
6、current large language models are significantly undertrained,a consequence of the recent focus on scaling language models whilst keeping the amount of training data constantWe find that for compute-optimal training,the model size and the number of training tokens should be scaled equally:for every d