{"id":1689,"date":"2025-02-06T07:02:23","date_gmt":"2025-02-06T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/06\/training-large-language-models-from-trpo-to-grpo\/"},"modified":"2025-02-06T07:02:23","modified_gmt":"2025-02-06T07:02:23","slug":"training-large-language-models-from-trpo-to-grpo","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/06\/training-large-language-models-from-trpo-to-grpo\/","title":{"rendered":"Training Large Language Models: From TRPO to\u00a0GRPO"},"content":{"rendered":"<p>    Training Large Language Models: From TRPO to\u00a0GRPO<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/tag\/deepseek\/\" title=\"Deepseek\">Deepseek<\/a> has recently made <strong>quite a buzz <\/strong>in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning (RL) side of things: we will cover TRPO, PPO, and, more recently, GRPO (don\u2019t worry, I will explain all these terms soon!)\u00a0<\/p>\n<p class=\"wp-block-paragraph\">I have aimed to keep this article relatively easy to read and accessible, by minimizing the math, so you won\u2019t need a deep Reinforcement Learning background to follow along. However, I will assume that you have some familiarity with Machine Learning, Deep Learning, and a basic understanding of how LLMs work.<\/p>\n<p class=\"wp-block-paragraph\">I hope you enjoy the article!<\/p>\n<h3 class=\"wp-block-heading\">The 3 steps of LLM\u00a0training<\/h3>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/1%2AgDF9kq-aLcUvG0XFBU2DGA.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\">The 3 steps of LLM training\u00a0[1]<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Before diving into RL specifics, let\u2019s briefly recap the three main stages of training a Large Language Model:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Pre-training<\/strong>: the model is trained on a massive dataset to predict the next token in a sequence based on preceding tokens.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Supervised Fine-Tuning (SFT)<\/strong>: the model is then <strong>fine-tuned<\/strong> on more targeted data and aligned with specific instructions.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Reinforcement Learning <\/strong>(often called <em>RLHF<\/em> for Reinforcement Learning with Human Feedback): this is the focus of this article. The main goal is to further refine responses\u2019 alignments with human preferences, by allowing the model to learn directly from feedback.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Reinforcement Learning\u00a0Basics<\/h3>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2An2QZe4jUFk_OccaY.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\">A robot trying to exit a maze!\u00a0[2]<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Before diving deeper, let\u2019s briefly revisit the core ideas behind Reinforcement Learning.<\/p>\n<p class=\"wp-block-paragraph\">RL is quite straightforward to understand at a high level: an<strong> agent<\/strong> interacts with an <strong>environment<\/strong>. The agent resides in a specific <strong>state<\/strong> within the environment and can take <strong>actions<\/strong> to transition to other states. Each action yields a <strong>reward<\/strong> from the environment: this is how the environment provides feedback that guides the agent\u2019s future actions.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Consider the following example: a <strong>robot<\/strong> (the agent) navigates (and tries to exit) a <strong>maze<\/strong> (the environment).<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <strong>state<\/strong> is the current situation of the environment (the robot\u2019s position in the maze).<\/li>\n<li class=\"wp-block-list-item\">The robot can take different <strong>actions<\/strong>: for example, it can move forward, turn left, or turn right.<\/li>\n<li class=\"wp-block-list-item\">Successfully navigating towards the exit yields a <strong>positive reward<\/strong>, while hitting a wall or getting stuck in the maze results in <strong>negative rewards.<\/strong>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Easy! Now, let\u2019s now make an analogy to how RL is used in the context of LLMs.<\/p>\n<h3 class=\"wp-block-heading\">RL in the context of\u00a0LLMs<\/h3>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/1%2A8hJsPwyITye84NWqHAe2MA.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\">Simplified RLHF Process\u00a0[3]<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">When used during LLM training, RL is defined by the following components:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The LLM itself <strong>is the agent<\/strong>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Environment<\/strong>: everything external to the LLM, including user prompts, feedback systems, and other contextual information. This is basically the framework the LLM is interacting with during training.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Actions: <\/strong>these are responses to a query from the model. More specifically: these are the <strong>tokens<\/strong> that the LLM decides to generate in response to a query.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>State: <\/strong>the current query being answered along with tokens the LLM has generated so far (i.e., the partial responses).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Rewards:<\/strong> this is a bit more tricky here: unlike the maze example above, there is <strong>usually<\/strong> no binary reward. In the context of LLMs, rewards usually come from a separate <em>reward model<\/em>, which outputs a score for each (query, response) pair. This model is trained from human-annotated data (hence \u201cRLHF\u201d) where annotators rank different responses. The goal is for higher-quality responses to receive higher rewards.<\/li>\n<\/ul>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Note: in some cases, rewards can actually get simpler. For example, in DeepSeekMath, <strong>rule-based approaches<\/strong> can be used because math responses tend to be more deterministic (correct or wrong answer)<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\"><strong>Policy<\/strong> is the final concept we need for now. In RL terms, a policy is simply the strategy for deciding which action to take. In the case of an LLM, the policy outputs a probability distribution over possible tokens at each step: in short, this is what the model uses to sample the next token to generate. Concretely, the policy is determined by the model\u2019s parameters (weights). During RL training, we adjust these parameters so the LLM becomes more likely to produce \u201cbetter\u201d tokens\u2014 that is, tokens that produce higher reward scores.<\/p>\n<p class=\"wp-block-paragraph\">We often write the policy as:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2Aoj50mh1PYdcv5CBf.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">where <em>a<\/em> is the action (a token to generate), <em>s<\/em> the state (the query and tokens generated so far), and <em>\u03b8 <\/em>(model\u2019s parameters).<\/p>\n<p class=\"wp-block-paragraph\">This idea of finding the best policy is the whole point of RL! Since we don\u2019t have labeled data (like we do in supervised learning) <strong>we use rewards to adjust our policy to take better actions.<\/strong> <em>(In LLM terms: we adjust the parameters of our LLM to generate better tokens.)<\/em><\/p>\n<h3 class=\"wp-block-heading\">TRPO (Trust Region Policy Optimization)<\/h3>\n<h4 class=\"wp-block-heading\">An analogy with supervised learning<\/h4>\n<p class=\"wp-block-paragraph\">Let\u2019s take a quick step back to how supervised learning typically works. you have labeled data and use a loss function (like cross-entropy) to measure how close your model\u2019s predictions are to the true labels.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2AYaLNgOyGdTJdnmzK.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">We can then use algorithms like backpropagation and gradient descent to minimize our loss function and update the weights <em>\u03b8<\/em> of our model.<\/p>\n<p class=\"wp-block-paragraph\">Recall that our policy also outputs probabilities! In that sense, it is analogous to the model\u2019s predictions in supervised learning\u2026 We are tempted to write <em>something like<\/em>:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2AmCxBSj_yWyLAHEEV.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">where <em>s<\/em> is the current state and <em>a<\/em> is a possible action.<\/p>\n<p class=\"wp-block-paragraph\"><em>A(s, a) <\/em>is called the <strong>advantage function<\/strong> and measures how good is the chosen action in the current state, compared to a baseline. This is very much like the notion of <strong>labels <\/strong>in supervised learning but derived from <strong>rewards<\/strong> instead of explicit labeling. <em>To simplify<\/em>, we can write the advantage as:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2ABkfvw7w81TyeuKEs.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">In practice, the baseline is calculated using a <strong>value function<\/strong>. This is a common term in RL that I will explain later. What you need to know for now is that it measures the expected reward we would receive if we continue following the current policy from the state <em>s<\/em>.<\/p>\n<h4 class=\"wp-block-heading\">What is\u00a0TRPO?<\/h4>\n<p class=\"wp-block-paragraph\">TRPO (Trust Region Policy Optimization) builds on this idea of using the advantage function but adds a critical ingredient for <strong>stability<\/strong>: it <strong>constrains<\/strong> how far the new policy can deviate from the old policy at each update step (similar to what we do with batch gradient descent for example).<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It introduces a KL divergence term (see it as a measure of similarity) between the current and the old policy:<\/li>\n<\/ul>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2A2pSGRrxLNHmL9Yns.png?ssl=1\" alt=\"\"><\/figure>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It also divides the policy by the old policy. This ratio, multiplied by the advantage function, gives us a sense of how beneficial each update is <strong>relative to the old policy<\/strong>.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Putting it all together, TRPO tries to <strong>maximize<\/strong> a surrogate objective (which involves the advantage and the policy ratio) subject to a <strong>KL divergence constraint<\/strong>.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2AO6IZum3z9DVSJTIS.png?ssl=1\" alt=\"\"><\/figure>\n<h3 class=\"wp-block-heading\">PPO (Proximal Policy Optimization)<\/h3>\n<p class=\"wp-block-paragraph\">While TRPO was a significant advancement, it\u2019s no longer used widely in practice, especially for training LLMs, due to its computationally intensive gradient calculations.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Instead, PPO is now the preferred approach in most LLMs architecture, including ChatGPT, Gemini, and more.<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">It is actually quite similar to TRPO, but instead of enforcing <strong>a hard constraint on the KL divergence<\/strong>, PPO introduces a \u201c<strong>clipped<\/strong> surrogate objective\u201d that implicitly restricts policy updates, and greatly simplifies the optimization process.<\/p>\n<p class=\"wp-block-paragraph\">Here is a breakdown of the PPO objective function we maximize to tweak our model\u2019s parameters.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2400\/1%2AoVi9NcQvD15nTp4G7rJueA.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\">Image by the\u00a0Author<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">GRPO (Group Relative Policy Optimization)<\/h3>\n<h4 class=\"wp-block-heading\">How is the value function usually obtained?<\/h4>\n<p class=\"wp-block-paragraph\">Let\u2019s first talk more about the <strong>advantage<\/strong> and the <strong>value functions<\/strong> I introduced earlier.<\/p>\n<p class=\"wp-block-paragraph\">In typical setups (like PPO), a <strong>value model<\/strong> is trained alongside the policy. Its goal is to predict the value of each action we take (each token generated by the model), using the rewards we obtain (remember that the value should represent the expected cumulative reward).<\/p>\n<p class=\"wp-block-paragraph\">Here is how it works in practice. Take the query \u201cWhat is 2+2?\u201d as an example. Our model outputs \u201c2+2 is 4\u201d and receives a reward of 0.8 for that response. We then go backward and attribute <strong>discounted rewards<\/strong> to each prefix:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201c2+2 is 4\u201d gets a value of 0.8<\/li>\n<li class=\"wp-block-list-item\">\u201c2+2 is\u201d (1 token backward) gets a value of 0.8<em>\u03b3<\/em>\n<\/li>\n<li class=\"wp-block-list-item\">\u201c2+2\u201d (2 tokens backward) gets a value of 0.8<em>\u03b3\u00b2<\/em>\n<\/li>\n<li class=\"wp-block-list-item\">etc.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">where <em>\u03b3<\/em> is the discount factor (0.9 for example). We then use these prefixes and associated values to train the value model.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Important note: the value model and the reward model are two different things. The reward model is trained before the RL process and uses pairs of (query, response) and human ranking. The value model is trained concurrently to the policy, and aims at predicting the future expected reward at each step of the generation process.<\/p>\n<\/blockquote>\n<h4 class=\"wp-block-heading\">What\u2019s new in\u00a0GRPO<\/h4>\n<p class=\"wp-block-paragraph\">Even if in practice, the reward model is often derived from the policy (training only the \u201chead\u201d), we still end up maintaining many models and handling multiple training procedures (policy, reward, value model). <strong>GRPO<\/strong> streamlines this by introducing a more efficient method.<\/p>\n<p class=\"wp-block-paragraph\">Remember what I said earlier?<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2ABkfvw7w81TyeuKEs.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">In PPO, we decided to use our value function as the baseline. GRPO chooses something else: Here is what GRPO does: concretely, <strong>for each query<\/strong>, GRPO generates a group of responses (group of size G) and uses their rewards to calculate each response\u2019s advantage as a <strong>z-score<\/strong>:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1600\/0%2A3cqWAmG6B5tNjU8A.png?ssl=1\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">where <em>r\u1d62<\/em> is the reward of the <em>i<\/em>-th response and <em>\u03bc<\/em> and <em>\u03c3<\/em> are the mean and standard deviation of rewards in that group.<\/p>\n<p class=\"wp-block-paragraph\">This naturally eliminates the need for a separate value model. This idea makes a lot of sense when you think about it! <strong>It aligns with the value function we introduced before<\/strong> and also measures, in a sense, an \u201cexpected\u201d reward we can obtain. Also, this new method is well adapted to our problem because LLMs can easily generate multiple <strong>non-deterministic outputs<\/strong> by using a low <em>temperature <\/em>(controls the randomness of tokens generation).<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">This is the main idea behind GRPO: getting rid of the value model.<\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Finally, GRPO adds a <strong>KL divergence<\/strong> term (to be exact, GRPO uses a simple approximation of the KL divergence to improve the algorithm further) directly into its objective, comparing the current policy to a <strong>reference policy<\/strong> (often the post-SFT model).<\/p>\n<p class=\"wp-block-paragraph\">See the final formulation below:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/2400\/1%2Auw87yMbJbxD2RsV2WWQPrg.png?ssl=1\" alt=\"\"><figcaption class=\"wp-element-caption\">Image by the\u00a0Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><strong>And\u2026 that\u2019s mostly it for GRPO!<\/strong> I hope this gives you a clear overview of the process: it still relies on the same foundational ideas as TRPO and PPO but introduces additional improvements to make training more efficient, faster, and cheaper\u200a\u2014\u200akey factors behind <strong>DeepSeek\u2019s success<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n<p class=\"wp-block-paragraph\">Reinforcement Learning has become a cornerstone for training today\u2019s Large Language Models, particularly through PPO, and more recently GRPO. Each method rests on the same RL fundamentals\u200a\u2014\u200astates, actions, rewards, and policies\u200a\u2014\u200abut adds its own twist to balance stability, efficiency, and human alignment:<\/p>\n<p class=\"wp-block-paragraph\">\u2022 <strong>TRPO<\/strong> introduced strict policy constraints via KL divergence<\/p>\n<p class=\"wp-block-paragraph\">\u2022 <strong>PPO<\/strong> eased those constraints with a clipped objective<\/p>\n<p class=\"wp-block-paragraph\">\u2022 <strong>GRPO<\/strong> took an extra step by removing the value model requirement and using group-based reward normalization. Of course, DeepSeek also benefits from other innovations, like high-quality data and other training strategies, but that is for another time!<\/p>\n<p class=\"wp-block-paragraph\">I hope this article gave you a clearer picture of how these methods connect and evolve. I believe that Reinforcement Learning will become <strong>the main focus in training LLMs<\/strong> to improve their performance, surpassing pre-training and SFT in driving future innovations.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">If you\u2019re interested in diving deeper, feel free to check out the references below or explore my previous posts.<\/p>\n<p class=\"wp-block-paragraph\">Thanks for reading, and feel free to leave a clap and a comment!<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\">Want to learn more about Transformers or dive into the math behind the Curse of Dimensionality? Check out my previous articles:<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/transformers-how-do-they-transform-your-data-72d69e383e0d\"><strong>Transformers: How Do They Transform Your Data?<\/strong><br \/><em>Diving into the Transformers architecture and what makes them unbeatable at language tasks<\/em>towardsdatascience.com<\/a><a href=\"https:\/\/towardsdatascience.com\/transformers-how-do-they-transform-your-data-72d69e383e0d\"><\/a><\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/the-math-behind-the-curse-of-dimensionality-cf8780307d74\"><strong>The Math Behind \u201cThe Curse of Dimensionality\u201d<\/strong><br \/><em>Dive into the \u201cCurse of Dimensionality\u201d concept and understand the math behind all the surprising phenomena that arise\u2026<\/em>towardsdatascience.com<\/a><a href=\"https:\/\/towardsdatascience.com\/the-math-behind-the-curse-of-dimensionality-cf8780307d74\"><\/a><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Feel free to connect on <a href=\"https:\/\/www.linkedin.com\/in\/maxime-wolf\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Follow me on <a href=\"https:\/\/github.com\/maxime7770\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a> for more content<\/li>\n<li class=\"wp-block-list-item\">Visit my website: <a href=\"http:\/\/maximewolf.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">maximewolf.com<\/a>\n<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\">References:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">[1] \u201cFoundations of Large Language Models\u201d, 2025. <a href=\"https:\/\/arxiv.org\/pdf\/2501.09223\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2501.09223<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[2] <strong>\u201c<\/strong>Reinforcement Learning<strong>.\u201d<\/strong> Enaris. Available at: <a href=\"https:\/\/enaris.org\/material\/en\/Reinforcement%20Learning\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/enaris.org\/material\/en\/Reinforcement%20Learning\/index.html<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[3] Y. Gokhale. \u201cIntroduction to LLMs and the Generative AI Part 5: RLHF,\u201d <em>Medium<\/em>, 2023. Available at: <a href=\"https:\/\/medium.com\/@yash9439\/introduction-to-llms-and-the-generative-ai-part-5-rlhf-64e83fbcd795\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/medium.com\/@yash9439\/introduction-to-llms-and-the-generative-ai-part-5-rlhf-64e83fbcd795<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[4] L. Weng. \u201cAn Overview of Reinforcement Learning,\u201d 2018. Available at: <a href=\"https:\/\/lilianweng.github.io\/posts\/2018-02-19-rl-overview\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/lilianweng.github.io\/posts\/2018-02-19-rl-overview\/<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[5] \u201cDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning\u201d, 2025. <a href=\"https:\/\/arxiv.org\/pdf\/2501.12948\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2501.12948<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[6] \u201cDeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\u201d, 2025. <a href=\"https:\/\/arxiv.org\/pdf\/2402.03300\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/2402.03300<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">[7] \u201cTrust Region Policy Optimization\u201d, 2017. <a href=\"https:\/\/arxiv.org\/pdf\/1502.05477\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/pdf\/1502.05477<\/a>\n<\/li>\n<\/ul>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/\">Training Large Language Models: From TRPO to\u00a0GRPO<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Maxime Wolf<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Training Large Language Models: From TRPO to\u00a0GRPO Deepseek has recently made quite a buzz in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1592,71,87,70,1498,1650],"tags":[199,1217,319],"class_list":["post-1689","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-deepseek","category-large-language-models","category-llm","category-machine-learning","category-model-training","category-reinfocement-learning","tag-learning","tag-reinforcement","tag-training"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1689"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1689"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1689\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}