{"id":2121,"date":"2025-02-28T07:05:19","date_gmt":"2025-02-28T07:05:19","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/28\/how-llms-work-reinforcement-learning-rlhf-deepseek-r1-openai-o1-alphago\/"},"modified":"2025-02-28T07:05:19","modified_gmt":"2025-02-28T07:05:19","slug":"how-llms-work-reinforcement-learning-rlhf-deepseek-r1-openai-o1-alphago","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/28\/how-llms-work-reinforcement-learning-rlhf-deepseek-r1-openai-o1-alphago\/","title":{"rendered":"How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo"},"content":{"rendered":"<p>    How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">Welcome to part 2 of my LLM deep dive. If you\u2019ve not read Part 1, I highly encourage you to <a href=\"https:\/\/towardsdatascience.com\/how-llms-work-pre-training-to-post-training-neural-networks-hallucinations-and-inference\/\">check it out first<\/a>.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Previously, we covered the first two major stages of training an LLM:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Pre-training\u200a\u2014\u200aLearning from massive datasets to form a base model.<\/li>\n<li class=\"wp-block-list-item\">Supervised fine-tuning (SFT)\u200a\u2014\u200aRefining the model with curated examples to make it useful.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Now, we\u2019re diving into the next major stage: <strong>Reinforcement Learning (RL)<\/strong>. While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline.<\/p>\n<p class=\"wp-block-paragraph\">I\u2019ve taken reference from <a href=\"https:\/\/www.youtube.com\/watch?app=desktop&amp;v=7xTGNNLPyMI\">Andrej Karpathy\u2019s widely popular 3.5-hour YouTube<\/a>. Andrej is a founding member of OpenAI, his insights are gold\u200a\u2014\u200ayou get the idea.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s go <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f680.png?ssl=1\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/p>\n<h2 class=\"wp-block-heading\">What\u2019s the purpose of reinforcement learning (RL)?<\/h2>\n<p class=\"wp-block-paragraph\">Humans and LLMs process information differently. What\u2019s intuitive for us\u200a\u2014\u200alike basic arithmetic\u200a\u2014\u200amay not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.<\/p>\n<p class=\"wp-block-paragraph\">This difference in cognition makes it challenging for human annotators to provide the \u201cperfect\u201d set of labels that consistently guide an LLM toward the right answer.<\/p>\n<p class=\"wp-block-paragraph\"><em>RL bridges this gap by allowing the model to <\/em><strong><em>learn from its own experience<\/em><\/strong><em>.<\/em><\/p>\n<p class=\"wp-block-paragraph\">Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback\u200a\u2014\u200areward signals\u200a\u2014\u200aon which outputs are most useful. Over time, it learns to align better with human intent.<\/p>\n<h2 class=\"wp-block-heading\">Intuition behind RL<\/h2>\n<p class=\"wp-block-paragraph\">LLMs are stochastic\u200a\u2014\u200ameaning their responses aren\u2019t fixed. Even with the same prompt, the output varies because it\u2019s sampled from a probability distribution.<\/p>\n<p class=\"wp-block-paragraph\">We can harness this randomness by generating thousands or even millions of possible responses <strong>in parallel<\/strong>. Think of it as the model exploring different paths\u200a\u2014\u200asome good, some bad. <strong>Our goal is to encourage it to take the better paths more often.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, <strong>reinforcement learning allows the model to <\/strong><strong><em>learn from itself.<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.<\/p>\n<p class=\"wp-block-paragraph\">But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.<\/p>\n<h2 class=\"wp-block-heading\">RL is not \u201cnew\u201d\u200a\u2014\u200aIt can surpass human expertise (AlphaGo, 2016)<\/h2>\n<p class=\"wp-block-paragraph\">A great example of RL\u2019s power is DeepMind\u2019s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.<\/p>\n<p class=\"wp-block-paragraph\">In the <a href=\"https:\/\/discovery.ucl.ac.uk\/id\/eprint\/10045895\/1\/agz_unformatted_nature.pdf\">2016 Nature paper<\/a> (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, <strong>but never surpass it<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">The dotted line represents Lee Sedol\u2019s performance\u200a\u2014\u200athe best Go player in the world.<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>This is because SFT is about replication, not innovation\u200a\u2014\u200ait doesn\u2019t allow the model to discover new strategies beyond human knowledge.<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately <strong>exceed human expertise<\/strong> (blue line).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f7f7f9\" data-has-transparency=\"true\" style=\"--dominant-color: #f7f7f9;\" loading=\"lazy\" decoding=\"async\" width=\"1003\" height=\"1024\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-11-1003x1024.png?resize=1003%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-598503 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-11-1003x1024.png 1003w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-11-294x300.png 294w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-11-768x784.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-11.png 1013w\" sizes=\"auto, (max-width: 1003px) 100vw, 1003px\"><figcaption class=\"wp-element-caption\">Image taken from <a href=\"https:\/\/discovery.ucl.ac.uk\/id\/eprint\/10045895\/1\/agz_unformatted_nature.pdf\">AlphaGo 2016 paper<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">RL represents an exciting frontier in AI\u200a\u2014\u200awhere models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it\u2019s thinking strategies.<\/p>\n<h2 class=\"wp-block-heading\">RL foundations recap<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s quickly recap the key components of a typical RL setup:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f4f5f2\" data-has-transparency=\"false\" style=\"--dominant-color: #f4f5f2;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"460\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12-1024x460.png?resize=1024%2C460&#038;ssl=1\" alt=\"\" class=\"wp-image-598504 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12-1024x460.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12-300x135.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12-768x345.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12-1536x689.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-12.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong><em>Agent<\/em><\/strong><em> <\/em>\u2014<em> <\/em>The learner or decision maker. It observes the current situation (<em>state<\/em>), chooses an action, and then updates its behaviour based on the outcome (<em>reward<\/em>).<\/li>\n<li class=\"wp-block-list-item\">\n<strong><em>Environment<\/em><\/strong>\u200a \u2014\u200aThe external system in which the agent operates.<\/li>\n<li class=\"wp-block-list-item\">\n<strong><em>State<\/em><\/strong> \u2014\u200a A snapshot of the environment at a given step <em>t<\/em>.\u00a0<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">At each timestamp, the agent performs an <strong><em>action<\/em><\/strong> in the environment that will change the environment\u2019s state to a new one. The agent will also receive feedback indicating how good or bad the action was.<\/p>\n<p class=\"wp-block-paragraph\">This feedback is called a <strong><em>reward<\/em><\/strong>, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.<\/p>\n<p class=\"wp-block-paragraph\">By using feedback from different states and actions, the agent gradually learns the optimal strategy to <strong>maximise the total reward<\/strong> over time.<\/p>\n<h3 class=\"wp-block-heading\">Policy<\/h3>\n<p class=\"wp-block-paragraph\">The policy is the agent\u2019s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.<\/p>\n<p class=\"wp-block-paragraph\"><strong>In mathematical terms, it is a function that determines the probability of different outputs for a given state\u200a\u2014\u200a<\/strong><strong><em>(\u03c0\u03b8(a|s))<\/em><\/strong><strong>.<\/strong><\/p>\n<h3 class=\"wp-block-heading\">Value function<\/h3>\n<p class=\"wp-block-paragraph\">An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">Actor-Critic architecture<\/h3>\n<p class=\"wp-block-paragraph\">It is a popular RL setup that combines two components:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Actor<\/strong>\u200a\u2014\u200aLearns and updates the <strong>policy<\/strong> (\u03c0\u03b8), deciding which action to take in each state.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Critic<\/strong>\u200a\u2014\u200aEvaluates the <strong>value function<\/strong> (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.\u00a0<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">How it works:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <strong>actor<\/strong> picks an action based on its current policy.<\/li>\n<li class=\"wp-block-list-item\">The <strong>critic<\/strong> evaluates the outcome (reward + next state) and updates its value estimate.<\/li>\n<li class=\"wp-block-list-item\">The critic\u2019s feedback helps the actor refine its policy so that future actions lead to higher rewards.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Putting it all together for LLMs<\/h3>\n<p class=\"wp-block-paragraph\">The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it\u2019s generated text is.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The policy is the model\u2019s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.<\/p>\n<h2 class=\"wp-block-heading\">DeepSeek-R1 (published 22 Jan 2025)<\/h2>\n<p class=\"wp-block-paragraph\">To highlight RL\u2019s importance, let\u2019s explore <a href=\"https:\/\/towardsdatascience.com\/tag\/deepseek\/\" title=\"Deepseek\">Deepseek<\/a>-R1, a reasoning model achieving top-tier performance while remaining open-source. <a href=\"https:\/\/arxiv.org\/abs\/2501.12948\">The paper introduced two models: <strong>DeepSeek-R1-Zero and DeepSeek-R1.<\/strong><\/a><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).<\/li>\n<li class=\"wp-block-list-item\">DeepSeek-R1 builds on it, addressing encountered challenges.<\/li>\n<\/ul>\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\">\n<p lang=\"en\" dir=\"ltr\">Deepseek R1 is one of the most amazing and impressive breakthroughs I\u2019ve ever seen \u2014 and as open source, a profound gift to the world. <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f916.png?ssl=1\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1fae1.png?ssl=1\" alt=\"\ud83e\udee1\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/p>\n<p>\u2014 Marc Andreessen <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f1fa-1f1f8.png?ssl=1\" alt=\"\ud83c\uddfa\ud83c\uddf8\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> (@pmarca) <a href=\"https:\/\/twitter.com\/pmarca\/status\/1882719769851474108?ref_src=twsrc%5Etfw\">January 24, 2025<\/a>\n<\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script>\n<\/div>\n<\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s dive into some of these key points.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">1. RL algo: Group Relative Policy Optimisation (GRPO)<\/h3>\n<p class=\"wp-block-paragraph\">One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). <a href=\"https:\/\/arxiv.org\/abs\/2402.03300\">GRPO was introduced in the DeepSeekMath paper in Feb 2024.<\/a>\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>Why GRPO over PPO?<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">PPO struggles with reasoning tasks due to:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Dependency on a critic model.<br \/>PPO needs a separate critic model, effectively doubling memory and compute.<br \/>Training the critic can be complex for nuanced or subjective tasks.<\/li>\n<li class=\"wp-block-list-item\">High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Absolute reward evaluations<br \/>When you rely on an absolute reward\u200a\u2014\u200ameaning there\u2019s a single standard or metric to judge whether an answer is \u201cgood\u201d or \u201cbad\u201d\u200a\u2014\u200ait can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.\u00a0<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\"><strong><em>How GRPO addressed these challenges:<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">GRPO eliminates the critic model by using<strong> relative evaluation<\/strong>\u200a\u2014\u200aresponses are compared within a group rather than judged by a fixed standard.<\/p>\n<p class=\"wp-block-paragraph\">Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.<\/p>\n<h4 class=\"wp-block-heading\"><strong>How does GRPO fit into the whole training process?<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">GRPO modifies how loss is calculated while keeping other training steps unchanged:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Gather data (queries + responses)<\/strong><strong><br \/><\/strong>\u2013 For LLMs, queries are like questions<br \/>\u2013 The old policy (older snapshot of the model) generates several candidate answers for each query<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Assign rewards\u200a<\/strong>\u2014\u200aeach response in the group is scored (the \u201creward\u201d).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Compute the GRPO loss<\/strong><strong><br \/><\/strong>Traditionally, you\u2019ll compute a loss\u200a\u2014\u200awhich shows the deviation between the model prediction and the true label.<br \/><strong>In GRPO, however, you measure:<\/strong><strong><br \/><\/strong>a) How likely is the new policy to produce past responses?<br \/>b) Are those responses relatively better or worse?<br \/>c) Apply clipping to prevent extreme updates.<br \/>This yields a scalar loss.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Back propagation + gradient descent<\/strong><strong><br \/><\/strong>\u2013 Back propagation calculates how each parameter contributed to loss<br \/>\u2013 Gradient descent updates those parameters to reduce the loss<br \/>\u2013 Over many iterations, this gradually shifts the new policy to prefer higher reward responses<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Update the old policy occasionally to match the new policy<\/strong>.<br \/>This refreshes the baseline for the next round of comparisons.<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\">2. Chain of thought (CoT)<\/h3>\n<p class=\"wp-block-paragraph\">Traditional LLM training follows pre-training \u2192 SFT \u2192 RL. However, DeepSeek-R1-Zero <strong>skipped SFT<\/strong>, allowing the model to directly explore CoT reasoning.<\/p>\n<p class=\"wp-block-paragraph\">Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI\u2019s o1 model also leverages this, as noted in its September 2024 report: <strong>o1\u2019s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).<\/strong><\/p>\n<p class=\"wp-block-paragraph\">DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses.<\/em><\/strong><\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"efeffa\" data-has-transparency=\"false\" style=\"--dominant-color: #efeffa;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"646\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14-1024x646.png?resize=1024%2C646&#038;ssl=1\" alt=\"\" class=\"wp-image-598506 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14-1024x646.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14-300x189.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14-768x484.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14-1536x969.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-14.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image taken from <a href=\"https:\/\/arxiv.org\/abs\/2501.12948\">DeepSeek-R1 paper<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.<\/p>\n<p class=\"wp-block-paragraph\">The model also had an \u201caha moment\u201d (below)\u200a\u2014\u200aa fascinating example of how RL can lead to unexpected and sophisticated outcomes.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"eeeded\" data-has-transparency=\"true\" style=\"--dominant-color: #eeeded;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"645\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-15-1024x645.png?resize=1024%2C645&#038;ssl=1\" alt=\"\" class=\"wp-image-598507 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-15-1024x645.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-15-300x189.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-15-768x484.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-15.png 1502w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image taken from <a href=\"https:\/\/arxiv.org\/abs\/2501.12948\">DeepSeek-R1 paper<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk\u200a\u2014\u200awhere someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.<\/p>\n<h2 class=\"wp-block-heading\">Reinforcement learning with Human Feedback (RLHF)<\/h2>\n<p class=\"wp-block-paragraph\">For tasks with verifiable outputs (e.g., math problems, factual Q&amp;A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there\u2019s no single \u201ccorrect\u201d answer?\u00a0<\/p>\n<p class=\"wp-block-paragraph\">This is where human feedback comes in\u200a\u2014\u200abut na\u00efve RL approaches are unscalable.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f9f6f5\" data-has-transparency=\"true\" style=\"--dominant-color: #f9f6f5;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"556\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16-1024x556.png?resize=1024%2C556&#038;ssl=1\" alt=\"\" class=\"wp-image-598508 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16-1024x556.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16-300x163.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16-768x417.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16-1536x834.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-16.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s look at the naive approach with some arbitrary numbers.<\/p>\n<figure class=\"wp-block-image alignwide size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f8f2f2\" data-has-transparency=\"false\" style=\"--dominant-color: #f8f2f2;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"179\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17-1024x179.png?resize=1024%2C179&#038;ssl=1\" alt=\"\" class=\"wp-image-598510 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17-1024x179.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17-300x52.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17-768x134.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17-1536x268.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-17.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">That\u2019s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI \u201creward model\u201d to learn human preferences, dramatically reducing human effort.\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>Ranking responses is also easier and more intuitive than absolute scoring.<\/em><\/strong><\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f0f2f1\" data-has-transparency=\"true\" style=\"--dominant-color: #f0f2f1;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"595\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18-1024x595.png?resize=1024%2C595&#038;ssl=1\" alt=\"\" class=\"wp-image-598511 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18-1024x595.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18-300x174.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18-768x446.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18-1536x893.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/image-18.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Upsides of RLHF<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.<\/li>\n<li class=\"wp-block-list-item\">Ranking outputs is much easier for human labellers than generating creative outputs themselves.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Downsides of RLHF<\/h2>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The reward model is an approximation\u200a\u2014\u200ait may not perfectly reflect human preferences.<\/li>\n<li class=\"wp-block-list-item\">RL is good at gaming the reward model\u200a\u2014\u200aif run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong><em>Do note that <a href=\"https:\/\/towardsdatascience.com\/tag\/rlhf\/\" title=\"Rlhf\">Rlhf<\/a> is not the same as traditional RL.<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">And that\u2019s a wrap! I hope you enjoyed Part 2 <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f642.png?ssl=1\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> If you haven\u2019t already read Part 1\u200a\u2014\u200a<a href=\"https:\/\/towardsdatascience.com\/how-llms-work-pre-training-to-post-training-neural-networks-hallucinations-and-inference\/\">do check it out here<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Got questions or ideas for what I should cover next? Drop them in the comments\u200a\u2014\u200aI\u2019d love to hear your thoughts. See you in the next article!<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-llms-work-reinforcement-learning-rlhf-deepseek-r1-openai-o1-alphago\/\">How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Clara Chong<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-llms-work-reinforcement-learning-rlhf-deepseek-r1-openai-o1-alphago\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo Welcome to part 2 of my LLM deep dive. If you\u2019ve not read Part 1, I highly encourage you to check it out first.\u00a0 Previously, we covered the first two major stages of training an LLM: Pre-training\u200a\u2014\u200aLearning from massive datasets to form a base [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1592,311,71,70,1787,1879],"tags":[199,134,103],"class_list":["post-2121","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-deepseek","category-getting-started","category-large-language-models","category-machine-learning","category-reinforcemect-learning","category-rlhf","tag-learning","tag-llm","tag-model"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2121"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2121"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2121\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2121"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2121"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2121"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}