1、 1 Current Best Practices for Training LLMs from Scratch 从零开始构建 大语言模型的关键要点 UTC NETWORK PTE LTDUTC NETWORK PTE LTD WWW.UTC.GROUPWWW.UTC.GROUP 2 Table of Contents 目录目录 Introduction.3 引言.3 BUILD VS.BUY PRE-TRAINED LLM MODELS.3 自建与采购预训练过的 LLM 模型.3 THE SCALING LAWS.8 缩放定律.8 HARDWARE.10 硬件.10 DATASET COLL
2、ECTION.17 数据集整合.17 DATASET PRE-PROCESSING.19 数据集预处理.19 PRE-TRAINING STEPS.26 预训练步骤.26 MODEL EVALUATION.33 模型评估.33 BIAS AND TOXICITY.36 偏差和毒性.36 INSTRUCTION TUNING.38 指令调优.38 REINFORCEMENT LEARNING THROUGH HUMAN FEEDBACK(RLHF).41 通过人类反馈强化学习(RLHF).41 Conclusion.44 结论.44 3 Introduction 引言引言 Although we
3、re only a few years removed from the transformer breakthrough,LLMs have already grown massively in performance,cost,and promise.But many of the critical details and key decision points are often passed down by word of mouth.虽然我们距离 Transformer 的突破只有几年,但大型语言模型(LLMs)在性能、成本和潜力方面已经有了巨大的提升。但很多关键细节和关键决策点通常
4、都是口口相传的。The goal of this white paper is to distill the best practices for training your own LLM for scratch.Well cover everything from scaling and hardware to dataset selection and model training,letting you know which tradeoffs to consider and flagging some potential pitfalls along the way.This is
5、meant to be a fairly ex-haustive look at the key steps and considerations youll make when training an LLM from scratch.本白皮书的目标是提炼出训练自己的大型语言模型(LLM)的最佳实践。我们将覆盖从缩放和硬件配置到数据集选择和模型训练的所有方面,以帮助您了解需要权衡的因素,并在这个过程中指出一些潜在的问题。这份白皮书旨在深入探讨从零开始训练 LLM 时需要考虑的关键步骤和注意事项。The first question you should ask yourself is whe
6、ther training one from scratch is right for your organization.As such,well start there:您应该首先考虑的问题是,从零开始训练一个大型语言模型是否适合您的组织。因此,我们将从这里开始:BUILD VS.BUY PRE-TRAINED LLM MODELS 自建与采购预训练过的自建与采购预训练过的 LLM 模型模型 Before starting LLM pre-training,the first question you need to ask is whether you should pre-train