{"id":1351,"date":"2025-01-22T07:02:57","date_gmt":"2025-01-22T07:02:57","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/22\/understanding-the-evolution-of-chatgpt-part-3-insights-from-codex-and-instructgpt-04ece2967bf7\/"},"modified":"2025-01-22T07:02:57","modified_gmt":"2025-01-22T07:02:57","slug":"understanding-the-evolution-of-chatgpt-part-3-insights-from-codex-and-instructgpt-04ece2967bf7","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/22\/understanding-the-evolution-of-chatgpt-part-3-insights-from-codex-and-instructgpt-04ece2967bf7\/","title":{"rendered":"Understanding the Evolution of ChatGPT: Part 3\u2014 Insights from Codex and InstructGPT"},"content":{"rendered":"<p>    Understanding the Evolution of ChatGPT: Part 3\u2014 Insights from Codex and InstructGPT<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Mastering the art of fine-tuning: Learnings for training your own\u00a0LLMs.<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AHNiOu0hotXxagUamrvGHnA.jpeg?ssl=1\"><figcaption>(Image from <a href=\"https:\/\/www.istockphoto.com\/photo\/success-transformation-gm1417761158-464748077\">Unsplash<\/a>)<\/figcaption><\/figure>\n<p>This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs.<\/p>\n<p>It is <strong>practical<\/strong> in the way that, if you were asked to train your own LLMs today, you can skip pre-training and jump straight into using an open-source LLM or SLM; However, very likely you\u2019ll still need to finetune it a bit on your own data and task, and that is where this article can come to\u00a0help.<\/p>\n<p>More specifically, we will focus on two finetuned models\u200a\u2014\u200aCodex and InstructGPT, as they represent addressing two types of challenges in LLM finetuning:<\/p>\n<ul>\n<li>Codex needs to adapt a pretrained LLM to a different modality (code script), as programming languages have many unique characteristics than natural language;<\/li>\n<li>InstructGPT aims to make the model more aligned with human preferences, which cannot be achieved automatically by traditional language modeling objectives.<\/li>\n<\/ul>\n<p>As we will see later, both challenges demand creativity and carefulness at every stage of the finetuning process: how to collect high-quality data, how to modify model architectures, how to effectively initialize your model, how to determine a proper objective, and how to properly evaluate\u00a0it.<\/p>\n<p>Below is the outline for this\u00a0article:<\/p>\n<ul>\n<li>\n<strong>Overview<\/strong>: why we need finetuning and what makes it so challenging; GPT3.5 and its finetuned versions.<\/li>\n<li>\n<strong>Codex<\/strong>: how to evaluate code generation properly, how to collect data and how to adapt the model to process programming languages.<\/li>\n<li>\n<strong>InstructGPT and ChatGPT<\/strong>: how to evaluate alignment, why RLHF works, and how it is implemented in InstructGPT.<\/li>\n<li>\n<strong>Summary: <\/strong>best practices in LLM finetuning.<\/li>\n<\/ul>\n<p>Below are the links to our previous articles if you are interested:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/medium.com\/towards-data-science\/understanding-the-evolution-of-gpt-part-1-an-in-depth-look-at-gpt-1-and-what-inspired-it-b7388a32e87d\">Part 1: An In-Depth Look at GPT-1 and What Inspired It<\/a>: where we cover the <strong>pre-training plus finetuning<\/strong> paradigm and its evolution from CV to NLP, previous pre-training efforts such as <strong>Word2Vec<\/strong> and <strong>GloVe<\/strong>, <strong>decoder-only Transformers<\/strong>, <strong>auto-regressive<\/strong> vs. <strong>auto-encoding<\/strong> LM, and key innovations of\u00a0<strong>GPT-1<\/strong>.<\/li>\n<li>\n<a href=\"https:\/\/medium.com\/towards-data-science\/understanding-the-evolution-of-chatgpt-part-2-gpt-2-and-gpt-3-77a01ed934c5\">Part 2: GPT-2 and GPT-3<\/a>: where we cover how GPT models were scaled up from 117M to 175B, under the philosophy of exploring <strong>task-agnostic pre-training<\/strong> via <strong>scaling hypothesis<\/strong> and <strong>in-context learning<\/strong>.<\/li>\n<\/ul>\n<h3>Overview<\/h3>\n<p>As we explained in <a href=\"https:\/\/medium.com\/towards-data-science\/understanding-the-evolution-of-chatgpt-part-2-gpt-2-and-gpt-3-77a01ed934c5\">our second article<\/a>, both GPT-2 and GPT-3 can be considered as OpenAI\u2019s experiments to test the potential of task-agnostic pre-training. While doing so, the authors also mentioned finetuning as a promising direction for future studies, as it might help the model to further improve its performance on certain\u00a0tasks.<\/p>\n<h4>Why is Finetuning Needed?<\/h4>\n<p>The reasons are three-fold.<\/p>\n<p>The first reason is of course performance. Pre-trained models are more like generalists that can perform a wide range of tasks reasonably well, but still they might struggle to beat the specialists trained on a particular task. If our goal is to have such a specialized model to help us on a very specific task, then finetuning should be definitely considered.<\/p>\n<p>Another reason is that, albeit being generally powerful, GPT-3 models are not always reliable in following human instructions, especially when those instructions became complex. This is because, as the authors explained in InstructGPT paper, that the pre-training objective focuses mainly on language modeling like predicting the next token, but such capabilities cannot translates to instruction-following. Thus, some special finetuning strategies are\u00a0needed.<\/p>\n<p>There are also concerns on safety and ethical aspects, due to very similar reasons that auto-regressive language modeling alone is not sufficient to enforce the model to avoid generating harmful or biased answers. For that issue, finetuning can also enable us to better control the generation process.<\/p>\n<h4>Challenges in Finetuning<\/h4>\n<p>Broadly speaking, there are two types of challenges in finetuning LLMs: the need to adapt to a new modality, and the need to align the model with human preferences.<\/p>\n<p>Taking Codex as an example for the former case, where the pre-trained model needs to be applied to a different modality that presents some unique characteristics, for example, to process code scripts it needs to understand basic syntax of a specific programming language, handle static and dynamic types and even infer types, and correctly handle indentations in languages like\u00a0Python.<\/p>\n<p>The latter case is more tricky in some way, as \u201calignment\u201d itself is a pretty vague and controversial concept, and it has to be defined more clearly and translated to a set of measurable aspects before we can actually finetuning towards that goal. Moreover, even if we have worked out a definition of alignment, achieving that goal is also non-trivial, as there is no ready-to-use training objectives directly connect to\u00a0it.<\/p>\n<p>On top of that, we also need to collect high-quality domain-specific training data and rethink the evaluation process, including the evaluation dataset as well as the evaluation metrics to\u00a0use.<\/p>\n<p>In later sections, we will see how Codex and InstructGPT handled these issues. In particular, we will highlight how they implemented every step with both creativity and carefulness, from which anyone who wants to finetune his or her own LLM can learn something.<\/p>\n<h4>GPT-3.5<\/h4>\n<p>GPT-3.5 series typically refer to the model series finetuned on top of GPT-3, including the following variants (see\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/GPT-3#GPT-3.5\">wiki<\/a>):<\/p>\n<ul>\n<li>code-davinvi-002: a version of\u00a0Codex.<\/li>\n<li>text-davinci-002: a transitional model from GPT-3 to InstructGPT.<\/li>\n<li>text-davinci-003: more similar to InstructGPT.<\/li>\n<\/ul>\n<p>Overall, GPT-3.5 could be considered as finetuned GPT-3 with enhanced instruction following, better generation quality, and better steerability. It is the foundation to several other models including ChatGPT, Codex, Whisper and the text model of DALL-E2, which demonstrates the potential of effectively finetuning LLMs on specialized tasks.<\/p>\n<p>In the following sections, we will dive deeper into Codex and InstructGPT. Rather than covering every detail of their finetuning process, we will mainly focus on the aspects that best showcase the importance of creativity and carefulness.<\/p>\n<h3>Codex<\/h3>\n<p>The <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex<\/a> model was released in 2021 and is specialized in Python code-writing.<\/p>\n<p>Below are a few aspects that we want to highlight.<\/p>\n<h4>Evaluation of Code Generation<\/h4>\n<p>When building a model for a new task, the first thing that often comes to mind is how to evaluate that task properly.<\/p>\n<p>This is important because, without an effective evaluation protocol, we cannot determine if we are really making any progress, and sometimes we even cannot identify the gaps in our current model in the first\u00a0place.<\/p>\n<p>In the case of Codex, the authors first realized that standard match-based metrics such as BLEU score are not suitable for measuring code generation performance.<\/p>\n<p>In case you are not familiar with <strong>BLEU score<\/strong>: it is widely used for evaluating text generation tasks such as machine translation, by comparing overlapping phrases and calculating a precision score, while also considering text length to ensure\u00a0balance.<\/p>\n<p>However, the same coding problem might be solved with different data structures or algorithms. For example, generating a Fibonacci sequence can be implemented by either a top-down or bottom-up DP algorithm, resulting in very different code\u00a0scripts:<\/p>\n<pre>def fib_top_down(n, memo={}):<br>    if n in memo:<br>        return memo[n]<br>    if n &lt;= 1:<br>        return n<br>    memo[n] = fib_top_down(n-1, memo) + fib_top_down(n-2, memo)<br>    return memo[n]<br><br>def fib_bottom_up(n):<br>    if n &lt;= 1:<br>        return n<br>    dp = [0] * (n + 1)<br>    dp[0], dp[1] = 0, 1<br>    for i in range(2, n + 1):<br>        dp[i] = dp[i-1] + dp[i-2]<br>    return dp[n]<\/pre>\n<p>In that case, if we evaluate both solutions against a given reference solution using BLEU score, it is very likely that one or even both solutions will have very low BLEU scores, even though both solutions are\u00a0correct.<\/p>\n<p>An alternative way is to evaluate by what the authors called \u201c<strong>functional correctness<\/strong>\u201d, for example the <strong><em>pass@k<\/em><\/strong> metric used by <a href=\"https:\/\/arxiv.org\/pdf\/1906.04908\">Kulal et al<\/a>, where for each problem we will generate <strong><em>k<\/em><\/strong> code samples and test each of them, and then a problem can be considered as solved if any sample passes the unit tests. In the end, the total fraction of problems solved is reported. However, as the authors pointed out, calculating <strong><em>pass@k<\/em><\/strong> with this definition will result in high variance due to randomness in this process, especially when <strong><em>k<\/em><\/strong> is\u00a0small.<\/p>\n<p>To mitigate this issue, the authors propose another way to estimate <strong><em>pass@k<\/em><\/strong>: instead of generating k samples directly, they generate n \u2265 k samples per task. As more samples are generated and tested, the estimation process will be more reliable even if k is small. And then, based on how many samples are correct (assume c samples passes unit tests), an unbiased estimator can be estimated as\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/997\/1%2AXBJx-hmKq1skZcQeHTWLfw.png?ssl=1\"><figcaption>Figure 1. Left: the optimized pass@k definition. right: a numerically stable script to calculate pass@k. (image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<p>where<\/p>\n<ul>\n<li>\n<em>C(n, k)<\/em> is the number of ways to choose k samples out of\u00a0n;<\/li>\n<li>\n<em>C(n-c, k)<\/em> is the number of ways to choose k samples out of the (n-c) incorrect samples;<\/li>\n<li>Thus, <em>C(n-c, k)\/C(n, k)<\/em> represents the probability that all chosen samples are incorrect;<\/li>\n<li>Finally, <em>1\u200a\u2014\u200aC(n-c, k)\/C(n, k)<\/em> represents the probability that at least one sample is\u00a0correct.<\/li>\n<\/ul>\n<p>To further prove that optimizing for BLEU score is not equivalent to optimizing for functional correctness, the authors also plot the BLEU score densities for correct (blue) and wrong (green) solutions for 4 random coding problems, where the distributions are clearly not separable:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/643\/1%2ANdevc5GUcClqJtSQe8qIOw.png?ssl=1\"><figcaption>Figure 2. BLEU score probability density for correct (blue) and wrong (green) solutions for 4 random problems. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<p>Beyond optimizing for the evaluation metric, the authors also built a new dataset called <strong>HumanEval<\/strong>, which contains 164 hand-written programming problems. As shown in the example below, each problem includes a function signature, a docstring, a body and an average of 7.7 unit\u00a0tests:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/989\/1%2Ae80_CJB3tyHH0Mj8fPdQIQ.png?ssl=1\"><figcaption>Figure 3. Example problems from the HumanEval dataset. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<p>Note that as the authors mentioned in the paper, it is important for these tasks to be hand-written, since otherwise the problems for evaluation might be overlap with that for training. Also, to ensure the testing process will not pose any risks due to malicious code, the authors also created a sandbox to execute code\u00a0scripts.<\/p>\n<h4>Training Data Collection<\/h4>\n<p>Moving to the training part, the first question is how to collect high-quality training data. For code generation, the good news is that we can leverage the vast amount of code repositories from GitHub, but still some data cleaning strategies are needed, as the paper mentioned:<\/p>\n<blockquote><p>We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.<\/p><\/blockquote>\n<p>Note that most of these cleaning strategies are specialized to programming languages, so we might need to come up with other ideas when cleaning our own\u00a0data.<\/p>\n<h4>Adaptations in Finetuning<\/h4>\n<p>The most important adaptation is for the tokenizer, due to the obvious reason that the distribution of words in GitHub code differs a lot from that of natural language. In the Codex paper, the authors noted that this is especially the case when <strong>encoding whitespaces<\/strong>, making the original GPT-3 tokenizer less effective.<\/p>\n<p>To fix that issue, an additional set of tokens were added to the vocabulary, to represent whitespace runs of different lengths. As mentioned in the paper, this simple modification enables representing code with 30% fewer\u00a0tokens.<\/p>\n<p>So, if our model needs to handle an input corpus presents different distribution with natural languages, we might need to do some study on the distribution and modify the tokenizer a bit as\u00a0well.<\/p>\n<h4>Findings in Evaluation<\/h4>\n<p>Firstly, the figure below shows the pass rates of different models on the HumanEval dataset. Overall, all the Codex variants show significantly better performance compared to GPT-3,\u00a0where<\/p>\n<ul>\n<li>Codex (finetuned on code) solves 28% of the problems;<\/li>\n<li>Codex-S (finetuned on standalone functions) solves\u00a037.7%;<\/li>\n<li>Codex-S with generating 100 samples and selecting the one with the highest mean log-probability solves\u00a044.5%;<\/li>\n<li>Codex-S oracle which selects the sample that passes the unit tests solves an amazing of of 77.5% problems.<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/621\/1%2A5Ba-BQIYtd3yNNw5KmDmag.png?ssl=1\"><figcaption>Figure 4. Codex pass rates. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<p>Plus, a scaling law similar to that of GPT-3 is also observed, suggesting better performance can be achieved with even larger\u00a0models:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/600\/1%2Ael_T_TVyqx_e11qWgz_QBA.png?ssl=1\"><figcaption>Figure 5. Test loss vs. number of parameters. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<p>And the authors also noticed that higher temperatures are more preferred for larger k, highlighting the importance of careful hyper-parameter tuning:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/653\/1%2AF1d5EKrrTe-kNmBbhRhTfw.png?ssl=1\"><figcaption>Figure 6. Higher temperatures are preferred for larger k. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2107.03374\">Codex\u00a0paper<\/a>.)<\/figcaption><\/figure>\n<h3>InstructGPT and\u00a0ChatGPT<\/h3>\n<h4>Evaluation of Alignment<\/h4>\n<p>How to properly evaluate \u201calignment\u201d is also challenging, as the definition of alignment is not as clear as other aspects such as accuracy. In this work the authors define alignment as <strong>if the models are \u201chelpful, honest, and harmless\u201d <\/strong>and convert them to more measurable properties<strong>:<\/strong><\/p>\n<ul>\n<li>\n<strong>Helpful<\/strong>: by measuring if the model could <strong>follow instructions<\/strong> and even infer intentions from a few-shot\u00a0prompt.<\/li>\n<li>\n<strong>Honest<\/strong>: by measuring truthfulness, or in the author\u2019s words, \u201cif the model\u2019s statements about the world are true\u201d. More specifically, they propose to measure it by <strong>hallucination rate<\/strong> on the TruthfulQA dataset.<\/li>\n<li>\n<strong>Harmless<\/strong>: by measuring \u201cif an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content\u201d, and benchmarking on datasets designed to measure <strong>bias<\/strong> and <strong>toxicity<\/strong>.<\/li>\n<\/ul>\n<p>On top of that, to make sure the finetuning process will not cause severe regressions on pre-training performance, the evaluation process also need to reflect quality on both the pre-training and finetuning objectives. For that reason, InstructGPT was evaluated on two separate datasets:<\/p>\n<ul>\n<li>\n<strong>Evaluations on API distribution<\/strong>: this is mainly for evaluating the finetuning quality, by asking human labelers to rate which output is preferred;<\/li>\n<li>\n<strong>Evaluations on public NLP datasets<\/strong>: this evaluates both the pre-training and finetuning quality, including traditional NLP datasets as well as datasets for evaluating model safety like truthfulness, toxicity and\u00a0bias.<\/li>\n<\/ul>\n<p>Next, we will briefly explain how RLHF works and how it is implemented in InstructGPT.<\/p>\n<h4>RLHF (Reinforcement Learning from Human Feedback)<\/h4>\n<p>The figure below shows the 5 elements in a typical Reinforcement Learning scenario:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/308\/1%2AzEels8f7xCaTTbxwsITggg.png?ssl=1\"><figcaption>Figure 7. Five elements in RL: Agent, Environment, Reward, State and Action. (Image from\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning\">wiki<\/a>.)<\/figcaption><\/figure>\n<p>Now imagine you are teaching your puppy to sit, where you can find all the 5 elements:<\/p>\n<ul>\n<li>\n<strong>Agent<\/strong>: Your puppy learning this new command\u00a0\u201csit\u201d.<\/li>\n<li>\n<strong>Environment<\/strong>: Everything around your\u00a0puppy.<\/li>\n<li>\n<strong>State<\/strong>: The situation your puppy is in (whether it is sitting or\u00a0not).<\/li>\n<li>\n<strong>Reward<\/strong>: A treat that you give your puppy when it follows your\u00a0command;<\/li>\n<li>\n<strong>Action<\/strong>: What your puppy could do, like sitting, jumping or\u00a0barking.<\/li>\n<\/ul>\n<p>Reinforcement Learning works like this: In the beginning your dog (agent) didn\u2019t understand what \u201csit\u201d means, but it will try different things like running, sitting or even barking (actions) in your house (environment). Every time it sits, it will get a treat (reward). Over time your puppy learns that sitting gets a treat and it appears like it finally understands \u201csit\u201d.<\/p>\n<p>Training a model with RL follows a very similar <strong>trial-and-error<\/strong> approach. The key to RL is having a well-designed reward. This reward must be closely aligned with the goal; otherwise the agent will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.<\/p>\n<p>For example, in a game, every action the agent takes will automatically get a score from the environment, and this score is directly connected to your agent\u2019s performance in playing this\u00a0game.<\/p>\n<p>However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions\u200a\u2014\u200afor example, how could you define a reward function to guide the agent to answer questions more politely?<\/p>\n<p>This leads to <strong>RLHF<\/strong>: <strong>Reinforcement Learning from Human Feedback<\/strong>.<\/p>\n<p>Again in the puppy training example, imagine your puppy finally learns to sit, but sometimes it also barks while sitting, or it will jump onto the couch first instead of sitting quietly on the\u00a0floor.<\/p>\n<p>What can you do in that\u00a0case?<\/p>\n<p>With <strong>RLHF<\/strong>, you don\u2019t just give your puppy a treat every time it sits. Instead, you give treats by <strong>comparing<\/strong> its behaviors. For example, if the puppy sits quietly on the floor, it gets a bigger reward than if it sits while barking or after jumping onto the couch. This way, your puppy learns that sitting quietly on the floor is better, even though you didn\u2019t explicitly explain what \u201cquiet\u201d\u00a0means.<\/p>\n<p>As we mentioned before, having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human into the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first, and then use these feedback to learn a reward function to mimic human preferences when comparing two\u00a0actions.<\/p>\n<p>In summary, RLHF typically involves three\u00a0stages:<\/p>\n<ul>\n<li>\n<strong>Collect human feedback<\/strong>: sampling model outputs, and ask human judges to compare which is\u00a0better.<\/li>\n<li>\n<strong>Learn a reward model<\/strong> by mimicking human judge&#8217;s preferences.<\/li>\n<li>\n<strong>Train a better policy<\/strong> using the leant reward model in the RL\u00a0process.<\/li>\n<\/ul>\n<p>In case you are not familiar with RL terminology: a <strong>policy<\/strong> refers to the agent\u2019s strategy to choose actions based on the state of the environment.<\/p>\n<p>Next we will cover how this RLHF approach is implemented in finetuning InstructGPT.<\/p>\n<h4>Implementation of RLHF in InstructGPT<\/h4>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2203.02155\">InstructGPT<\/a> and <a href=\"https:\/\/openai.com\/index\/chatgpt\/\">ChatGPT<\/a> were trained using the same model (see this <a href=\"https:\/\/openai.com\/index\/chatgpt\/\">blog<\/a>), with RLHF being the key element in finetuning.<\/p>\n<p>The training process largely follows the steps we have introduced in the previous section, with special care on data quality and implementation details, which in my opinion, <strong>are equivalently important to make InstructGPT such a\u00a0success<\/strong>.<\/p>\n<p>Now let me break it\u00a0down.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/998\/1%2AbkxVWz5OjA948ggkvcPi3A.png?ssl=1\"><figcaption>Figure 8. An illustration of the RLHF steps in training InstructGPT\/ChatGPT. (image from <a href=\"https:\/\/arxiv.org\/pdf\/2203.02155\">InstructGPT paper<\/a>.)<\/figcaption><\/figure>\n<p><strong>Step 1: Collect demonstration data and train a supervised policy<\/strong><\/p>\n<p>In this step, human labelers were asked to provide high-quality demonstrations of the desired behavior for each\u00a0prompt.<\/p>\n<p><strong>Prompt dataset<\/strong>: To begin with, you need to have a prompt dataset from which you can sample individual prompts, and ideally that prompt dataset should be both useful and\u00a0diverse.<\/p>\n<p>To do that, the authors took an iterative approach: in the very beginning, labelers were asked to manually write some seed prompts, and these data were used to train a model via supervised learning. This model was later deployed to the OpenAI API to collect text prompts from users, which later formed the prompt\u00a0dataset.<\/p>\n<p>The table below shows the distribution of this prompt dataset, as diversity is very important in making sure the model will be trained on various\u00a0tasks:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/339\/1%2AAHwKUSFYBNHdiM1rfNaOYw.png?ssl=1\"><\/figure>\n<p><strong>Human data collection<\/strong>: human data are needed in three components throughout the RLHF process, including writing demonstrations in Step 1, providing comparison data in Step 2, and conducting final evaluations after finetuning.<\/p>\n<p>In the paper the authors mentioned many practices to ensure data\u00a0quality:<\/p>\n<ul>\n<li>Firstly, high-quality data come from good labelers. To ensure their ability in data labeling, a screening test was conducted to select labelers who were \u201csensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful\u201d.<\/li>\n<li>Secondly, to ensure consistency between all the labelers, an onboarding process was setup to train all labelers, and detailed instructions for each task were provided. The authors also mentioned that they setup a shared chat room to answer questions from labelers.<\/li>\n<li>Finally, to see how the model generalizes to the preferences of different labelers, a separate group of labelers who didn\u2019t got through the screening test were hired for evaluation.<\/li>\n<\/ul>\n<p>Based on these human demonstration data, a pretrained GPT-3 model was finetuned using supervised learning in the first step. This model is referred to as the baseline policy, which will be used to produce comparison outputs in Step 2 and initialize the PPO algorithm in Step\u00a03.<\/p>\n<p><strong>Step 2: Collect comparison data and train a reward\u00a0model<\/strong><\/p>\n<p><strong>Comparison data collection<\/strong>: Once the baseline policy is available, it is used to generate outputs for some sampled prompts, and these outputs will be reviewed and ranked by human labelers from the best to the worst. To speedup this ranking process, a set of K outputs will be shown simultaneously to the human labelers, where K ranges from 4 to\u00a09.<\/p>\n<p><strong>Reward model training<\/strong>: The reward model was initialized from the supervised baseline policy, by removing the final unembedding layer and training on the comparison data. In particular, the authors mention that <strong>training all comparisons from each prompt as a single batch<\/strong> rather than shuffling the comparisons can help alleviate overfitting. It was trained to assign scalar scores to input-response pairs, with 6B parameters. Note that we need to seek a balance when deciding the size of this reward model: it needs to be sufficiently large to accurately mimic human preferences, however it cannot be too large since it needs to support fast inference during the RL\u00a0process.<\/p>\n<p><strong>Step 3: Optimize a policy using the reward model with\u00a0PPO<\/strong><\/p>\n<p>At this point we have got everything ready to finetune the model with RLHF: the initial policy and the reward model. The training in this step follows a typical RL process: in each episode, a new prompt is sampled (the \u201c<strong>state\u201d<\/strong>) and new outputs will be generated (the model\u2019s \u201c<strong>action\u201d<\/strong>) by the current policy (the \u201c<strong>agent\u201d<\/strong>), and then the reward model will calculate a reward for the output (\u201c<strong>reward\u201d<\/strong>), according to which the policy will be updated using\u00a0PPO.<\/p>\n<p>Don\u2019t worry if you are not familiar with <strong>PPO<\/strong>\u200a\u2014\u200ait is simply a method designed to help the agent to <strong>slowly<\/strong> update its strategies.<\/p>\n<p>A few things to mention\u00a0here:<\/p>\n<ul>\n<li>A per-token KL penalty is added at each token to mitigate the over-optimization of the reward\u00a0model.<\/li>\n<li>The authors further experimented with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (such regressions are often called \u201c<strong>the alignment tax<\/strong>\u201d), which was referred to as \u201cPPO-ptx\u201d. In this paper, <strong>InstructGPT<\/strong> actually refers to the PPO-ptx\u00a0models.<\/li>\n<\/ul>\n<p>Note that Step 2 and Step 3 can be iterated continuously:<\/p>\n<ul>\n<li>With an updated policy (from Step 3), we can generate new outputs and collect more comparison data, which can be used to train a new reward model by repeating Step\u00a02;<\/li>\n<li>With a new reward model (from Step 2), we can get a better policy by repeating Step\u00a03.<\/li>\n<\/ul>\n<h4>Findings in Evaluation<\/h4>\n<p>Due to space limitation we will not go through all the evaluation results in this article, instead we will just highlight several new findings.<\/p>\n<p>As perhaps the most important finding, results show that <strong>RLHF can indeed improve alignment<\/strong>. The figure below shows the win rate against the supervised 175B GPT3 model, evaluated by human judges. According to this figure, both PPO and PPO-ptx significantly outperform the GPT baselines, where even the 1.3B PPO models are better than the 175B GPT-3. This result clearly demonstrates the effectiveness of\u00a0RLHF.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AXormFyyLdwnAsd_IryWVxw.png?ssl=1\"><figcaption>Figure 9. Human evaluation results. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2203.02155\">InstructGPT paper<\/a>.)<\/figcaption><\/figure>\n<p>The authors also found that InstructGPT show <strong>improves in truthfulness<\/strong> (hallucination rate reduced from 41% to 21%), <strong>slight improvements in toxicity<\/strong> (25% fewer toxic outputs), but <strong>no significant improvements on reducing\u00a0bias<\/strong>.<\/p>\n<p>Another finding is that PPO-ptx can minimize performance regressions on public NLP datasets, as shown in the figure\u00a0below.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/822\/1%2ATqwG6rtyesfBLkcFQxsecA.png?ssl=1\"><figcaption>Figure 10. Few-shot performance on public NLP datasets. (Image from <a href=\"https:\/\/arxiv.org\/pdf\/2203.02155\">InstructGPT paper<\/a>.)<\/figcaption><\/figure>\n<h3>Summary<\/h3>\n<p>Training a LLM usually involves multiple stages like pre-training, supervised finetuning, and alignment with RLHF. For our tasks at hand, we can usually start from an open-source, pre-trained LLM and finetune it on domain-specific data.<\/p>\n<p>A few questions to ask while finetuning your own LLMs (though this is not meant to be an exhaustive list):<\/p>\n<ul>\n<li>Do we have a clear definition on the model\u2019s desired behaviors? How can we evaluate such behaviors? If no available metrics to use, can we create one by ourselves?<\/li>\n<li>Do we have available training data? If not, how can we collect such data by ourselves? If human labelers are needed, how to ensure their labeling\u00a0quality?<\/li>\n<li>What kind of cleaning or pre-processing is needed? Any heuristics can we use to check the data\u00a0quality?<\/li>\n<li>Does our data cover a wide range of scenarios?<\/li>\n<li>Do we need to modify our tokenizers? Do we need to modify the model structures? Do we need to add auxiliary finetuning objectives?<\/li>\n<li>Does finetuning lead to regression on pre-training performance? Can we seek a\u00a0balance?<\/li>\n<li>Does finetuning lead to some unexpected negative behaviors? How can we mitigate\u00a0that?<\/li>\n<li>How to prevent overfitting in the finetuning process?<\/li>\n<li>What hyper-parameters can we tune during finetuning or during evaluation? Any heuristics we can leverage?<\/li>\n<\/ul>\n<p>In the end of the day, exploring a new task is always both challenging and exciting, and I hope the learnings from this article can help make it less challenging, more exciting, and ultimately more enjoyable\u00a0\ud83d\ude42<\/p>\n<p>Thanks for\u00a0reading!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=04ece2967bf7\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/understanding-the-evolution-of-chatgpt-part-3-insights-from-codex-and-instructgpt-04ece2967bf7\">Understanding the Evolution of ChatGPT: Part 3\u2014 Insights from Codex and InstructGPT<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shirley Li<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Funderstanding-the-evolution-of-chatgpt-part-3-insights-from-codex-and-instructgpt-04ece2967bf7\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understanding the Evolution of ChatGPT: Part 3\u2014 Insights from Codex and InstructGPT Mastering the art of fine-tuning: Learnings for training your own\u00a0LLMs. (Image from Unsplash) This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs. It is practical in the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,367,88,71,260,92],"tags":[1404,7,1403],"class_list":["post-1351","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-chatgpt","category-deep-learning","category-large-language-models","category-nlp","category-thoughts-and-theory","tag-codex","tag-how","tag-instructgpt"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1351"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1351"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1351\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}