{"id":2188,"date":"2025-03-04T07:02:32","date_gmt":"2025-03-04T07:02:32","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/04\/how-to-train-llms-to-think-o1-deepseek-r1\/"},"modified":"2025-03-04T07:02:32","modified_gmt":"2025-03-04T07:02:32","slug":"how-to-train-llms-to-think-o1-deepseek-r1","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/04\/how-to-train-llms-to-think-o1-deepseek-r1\/","title":{"rendered":"How to Train LLMs to \u201cThink\u201d (o1 &amp; DeepSeek-R1)"},"content":{"rendered":"\n<div>How to Train LLMs to \u201cThink\u201d (o1 &#038; DeepSeek-R1)<\/div>\n<p> \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\" id=\"180c\">In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it \u201cadvanced reasoning\u201d capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the full technical details of their approach. In this article, I will discuss the key ideas behind this innovation and describe how they work under the hood.<\/p>\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title='How to Train LLMs to \"Think\" (o1 &amp; DeepSeek-R1)' width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/RveLjcNl0ds?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div>\n<\/figure>\n<p class=\"wp-block-paragraph\" id=\"49b1\">OpenAI\u2019s o1 model marked a new paradigm for training large language models (LLMs). It introduced so-called\u00a0<strong>\u201cthinking\u201d tokens<\/strong>, which enable a sort of\u00a0<strong>scratch pad that the model can use to think<\/strong>\u00a0<strong>through problems<\/strong>\u00a0and user queries.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e45d\">The major insight from o1 was performance improved with increased\u00a0<strong>test-time compute<\/strong>. This is just a fancy way of saying that\u00a0<strong>the more tokens a model generates, the better its response<\/strong>. The figure below, reproduced from OpenAI\u2019s blog, captures this point nicely.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"040404\" data-has-transparency=\"false\" style=\"--dominant-color: #040404;\" fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"479\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/AIME-accuracy-1024x479.png?resize=1024%2C479&#038;ssl=1\" alt=\"Graphs displaying AIME accuracy scaling with train-time and test-time compute.\" class=\"wp-image-598647 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/AIME-accuracy-1024x479.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/AIME-accuracy-300x140.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/AIME-accuracy-768x359.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/AIME-accuracy.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">AIME accuracy scaling with train-time and test-time compute, respectively. Plots reillustrated from [1].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"87a4\">In the plots above, the y-axes are model performance on AIME (math problems), while the x-axes are various compute times. The left plot depicts the well-known neural scaling laws that kicked off the LLM rush of 2023. In other words, the\u00a0<strong>longer a model is\u00a0<em>trained<\/em>\u00a0(i.e. train-time compute)<\/strong>, the\u00a0<strong>better its performance<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"36d7\">On the right, however, we see a new type of scaling law. Here,\u00a0<strong>the more\u00a0<em>tokens<\/em>\u00a0a model generates<\/strong>\u00a0<strong>(i.e. test-time compute)<\/strong>,\u00a0<strong>the better its performance<\/strong>.<\/p>\n<h2 class=\"wp-block-heading\" id=\"623d\">\u201cThinking\u201d tokens<\/h2>\n<p class=\"wp-block-paragraph\" id=\"8d23\">A key feature of o1 is its so-called\u00a0<strong>\u201cthinking\u201d tokens<\/strong>.\u00a0These are\u00a0<strong>special tokens introduced during post-training, which delimit the model\u2019s chain of thought (CoT)<\/strong>\u00a0reasoning (i.e., thinking through the problem). These special tokens are important for two reasons.<\/p>\n<p class=\"wp-block-paragraph\" id=\"5770\"><strong>One<\/strong>, they clearly demarcate where the model\u2019s \u201cthinking\u201d starts and stops so it can be easily parsed when spinning up a UI.\u00a0<strong>And two<\/strong>, it produces a human-interpretable readout of how the model \u201cthinks\u201d through the problem.<\/p>\n<p class=\"wp-block-paragraph\" id=\"09b0\">Although OpenAI disclosed that they used reinforcement learning to produce this ability, the exact details of\u00a0<em>how<\/em>\u00a0they did it were not shared. Today, however, we have a pretty good idea thanks to a recent publication from DeepSeek.<\/p>\n<h2 class=\"wp-block-heading\" id=\"03e5\">DeepSeek\u2019s paper<\/h2>\n<p class=\"wp-block-paragraph\" id=\"3dc7\">In January 2025, DeepSeek published \u201c<em>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning<\/em>\u201d [2]<strong>.<\/strong>\u00a0While this paper caused its fair share of pandemonium, its central contribution was\u00a0<strong>unveiling the secrets behind o1<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"f372\">It introduces two models:\u00a0<strong>DeepSeek-R1-Zero<\/strong>\u00a0and\u00a0<strong>DeepSeek-R1<\/strong>. The former was trained exclusively on reinforcement learning (RL), and the latter was a mixture of Supervised Fine-tuning (SFT) and RL.<\/p>\n<p class=\"wp-block-paragraph\" id=\"801f\">Although the headlines (and title of the paper) were about DeepSeek-R1, the former model is important because, one, it generated training data for R1, and two, it demonstrates striking emergent reasoning\u00a0<strong>abilities that were not taught to the model<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"70d3\">In other words,\u00a0<strong>R1-Zero\u00a0<em>discovers<\/em>\u00a0CoT and test-time compute scaling through RL alone!<\/strong>\u00a0Let\u2019s discuss how it works.<\/p>\n<h2 class=\"wp-block-heading\" id=\"58bf\">DeepSeek-R1-Zero (RL only)<\/h2>\n<p class=\"wp-block-paragraph\" id=\"21c1\"><strong>Reinforcement learning (RL)<\/strong>\u00a0is a <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> approach in which, rather than training models on explicit examples,\u00a0<strong>models learn through trial and error<\/strong>\u00a0[3]. It works by passing a reward signal to a model that has no explicit functional relationship with the model\u2019s parameters.<\/p>\n<p class=\"wp-block-paragraph\" id=\"96ab\">This is similar to how we often learn in the real world. For example, if I apply for a job and don\u2019t get a response, I have to figure out what I did wrong and how to improve. This is in contrast to supervised learning, which, in this analogy, would be like the recruiter giving me specific feedback on what I did wrong and how to improve.<\/p>\n<p class=\"wp-block-paragraph\" id=\"3829\">While using RL to train R1-Zero consists of many technical details, I want to highlight 3 key ones: the\u00a0<strong>prompt template<\/strong>,\u00a0<strong>reward signal<\/strong>, and\u00a0<strong>GRPO<\/strong>\u00a0(Group Relative Policy Optimization).<\/p>\n<h3 class=\"wp-block-heading\" id=\"2c27\">1) Prompt template<\/h3>\n<p class=\"wp-block-paragraph\" id=\"fdec\">The\u00a0<strong>template<\/strong>\u00a0used for training is given below, where <code>{prompt}<\/code>\u00a0is replaced with a question from a dataset of (presumably) complex math, coding, and logic problems. Notice the inclusion of <code>&lt;answer&gt;<\/code> and <code>&lt;think&gt;<\/code> tags via simple prompting.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">A conversation between User and Assistant. The user asks a question, and the \nAssistant solves it.The assistant first thinks about the reasoning process in \nthe mind and then provides the user with the answer. The reasoning process and \nanswer are enclosed within &lt;think&gt; &lt;\/think&gt; and &lt;answer&gt; &lt;\/answer&gt; tags, \nrespectively, i.e., &lt;think&gt; reasoning process here &lt;\/think&gt;\n&lt;answer&gt; answer here &lt;\/answer&gt;. User: {prompt}. Assistant:<\/code><\/pre>\n<p class=\"wp-block-paragraph\" id=\"15ab\">Something that stands out here is the minimal and relaxed prompting strategy. This was an intentional choice by DeepSeek to\u00a0<strong>avoid biasing model responses<\/strong>\u00a0and to\u00a0<strong>observe its natural evolution during RL<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\" id=\"0494\">2) Reward signal<\/h3>\n<p class=\"wp-block-paragraph\" id=\"d21c\">The RL\u00a0<strong>reward<\/strong>\u00a0has\u00a0<strong>two components<\/strong>:\u00a0<strong>accuracy and format rewards<\/strong>. Since the training dataset consists of questions with clear right answers, a simple rule-based strategy is used to evaluate response accuracy. Similarly, a rule-based formatting reward is used to ensure reasoning tokens are generated in between the thinking tags.<\/p>\n<p class=\"wp-block-paragraph\" id=\"6770\">It\u2019s noted by the authors that a neural reward model isn\u2019t used (i.e. rewards are not computed by a neural net), because these may be prone to\u00a0<strong>reward hacking<\/strong>. In other words, the\u00a0<strong>LLM learns how to\u00a0<em>trick<\/em>\u00a0the reward model into maximizing rewards<\/strong>\u00a0while decreasing downstream performance.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e77d\">This is just like how humans find ways to exploit any incentive structure to maximize their personal gains while forsaking the original intent of the incentives. This highlights the difficulty of producing good rewards (whether for humans or computers).<\/p>\n<h3 class=\"wp-block-heading\" id=\"9d9e\">3) GRPO (Group Relative Policy Optimization)<\/h3>\n<p class=\"wp-block-paragraph\" id=\"5b2c\">The final detail is how rewards are translated into model parameter updates. This section is quite technical, so the enlightened reader can feel free to skip ahead.<\/p>\n<p class=\"wp-block-paragraph\" id=\"4205\"><strong>GRPO<\/strong>\u00a0is an\u00a0<strong>RL approach that combines a collection of responses to update model parameters<\/strong>. To encourage stable training, the authors also incorporate clipping and KL-divergence regularization terms into the loss function. Clipping ensures optimization steps are not too big, and regularization ensures the model predictions do not change too abruptly.<\/p>\n<p class=\"wp-block-paragraph\" id=\"112e\">Here is the complete loss function with some (hopefully) helpful annotations.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"e9f0e3\" data-has-transparency=\"true\" style=\"--dominant-color: #e9f0e3;\" decoding=\"async\" width=\"1024\" height=\"480\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Annotated-GRPO-loss-function-1024x480.png?resize=1024%2C480&#038;ssl=1\" alt=\"Annotated GRPO loss function\" class=\"wp-image-598648 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Annotated-GRPO-loss-function-1024x480.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Annotated-GRPO-loss-function-300x141.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Annotated-GRPO-loss-function-768x360.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Annotated-GRPO-loss-function.png 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Annotated GRPO loss function. Image by author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\" id=\"92a7\">Results (emergent abilities)<\/h3>\n<p class=\"wp-block-paragraph\" id=\"c484\">The most striking result of R1-Zero is that, despite its minimal guidance, it develops effective reasoning strategies that we might recognize.<\/p>\n<p class=\"wp-block-paragraph\" id=\"fd58\">For example, it learns\u00a0<strong>implicitly through the RL to improve responses through test-time compute<\/strong>\u00a0(recall the earlier insight from o1). This is depicted via the plot below from the R1 paper [2].<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f1f9\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f1f9;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"561\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Average-response-length-increases-1024x561.png?resize=1024%2C561&#038;ssl=1\" alt=\"Graph demonstrating that average DeepSeek-R1-Zero response length increases with additional training\" class=\"wp-image-598649 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Average-response-length-increases-1024x561.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Average-response-length-increases-300x164.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Average-response-length-increases-768x421.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/03\/Average-response-length-increases.png 1291w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Average response length increases with additional training (and improved performance). Screenshot from DeepSeek R1 paper [2].<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"7216\">Another notable ability is reflecting on past responses with an uncanny human-like internal monologue. Here is an example from R1-Zero. Notice the\u00a0<strong>long CoT and additional verification step<\/strong>\u00a0before generating the final answer (I omitted some parts because the response was very long).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">User prompt: If it takes me 1.5 hours to write a codebase and my intern 8 \nhours. How long will it take us both to write 3 codebases?\n\n&lt;think&gt; To solve for how long it will take both you and your intern to write \nthree codebases together, we need to first find out how many codebases each of \nyou can write per hour individually.\n\n(long CoT) ...\n\nSo, it will take approximately 3.7895 hours for you and your intern to write \nthree codebases together.\n\nTo make sure everything makes sense, let us round the answer to a reasonable \nnumber of decimal places (e.g., round to two decimal places):\n\nt \u2248 3.79 hours\n\nHowever, if you need a more \"human-friendly\" answer, you could convert the \ndecimal part into minutes. Since 0.7895 of an hour is:\n\n0.7895 \u00d7 60 minutes \u2248 47.37 minutes\n\nSo, 3.7895 hours is approximately 3 hours and 47.37 minutes (or roughly 3 \nhours and 47 minutes if rounding to the nearest minute).\n\n## Verification (optional step for verification)\nTo verify, let us check how many codebases you can individually write in t = \n72\/19 hours:\n\nYour work in 72\/19 hours\n= (2\/3 codebases per hour) \u00d7 (72\/19 hours)\n= (2 \u00d7 72) \/ (3 \u00d7 19)\n= 144 \/ 57\n= 48\/19 codebases.\n\n(verification process) ...\n\nHowever, a more \"exact\" answer should be kept in fractional form unless \nspecified otherwise. Thus, the answer is: 72\/19hours which is approximately\n$3.7895$ hours or about $3.79$ hours when rounded to two decimal places).\n&lt;\/think&gt;\n\n72\/19 hours (or approximately 3.79 hours).<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"27c4\">Problems with R1-Zero<\/h3>\n<p class=\"wp-block-paragraph\" id=\"43d5\">Although the thinking tokens from R1-Zero give a human-readable window into the model\u2019s \u201cthought process,\u201d the authors report some issues. Namely, the learned CoT sometimes suffers from\u00a0<strong>readability issues and language mixing<\/strong>. Suggesting (perhaps) that its reasoning begins to veer away from something easily interpretable by humans.<\/p>\n<h2 class=\"wp-block-heading\" id=\"decf\">DeepSeek-R1 (SFT + RL)<\/h2>\n<p class=\"wp-block-paragraph\" id=\"8ac6\">To mitigate R1-Zero\u2019s interpretability issues, the authors explore a multi-step training strategy that\u00a0<strong>utilizes both supervised fine-tuning (SFT) and RL<\/strong>. This strategy results in\u00a0<strong>DeepSeek-R1<\/strong>, a better-performing model that is getting more attention today. The entire training process can be broken down into 4 steps.<\/p>\n<h3 class=\"wp-block-heading\" id=\"81ce\">Step 1: SFT with reasoning data<\/h3>\n<p class=\"wp-block-paragraph\" id=\"6727\">To help get the model on the right track when it comes to learning how to reason, the authors start with SFT. This\u00a0<strong>leverages 1000s of long CoT examples<\/strong>\u00a0from various sources, including few-shot prompting (i.e., showing examples of how to think through problems), directly prompting the model to use reflection and verification, and refining synthetic data from R1-Zero [2].<\/p>\n<p class=\"wp-block-paragraph\" id=\"d096\">The\u00a0<strong>two key advantages<\/strong>\u00a0of this are,\u00a0<strong>one<\/strong>, the desired response format can be explicitly shown to the model, and\u00a0<strong>two<\/strong>, seeing curated reasoning examples unlocks better performance for the final model.<\/p>\n<h3 class=\"wp-block-heading\" id=\"99ad\">Step 2: R1-Zero style RL (+ language consistency reward)<\/h3>\n<p class=\"wp-block-paragraph\" id=\"2abb\">Next, an RL training step is applied to the model after SFT. This is done in an\u00a0<strong>identical way as R1-Zero with an added component<\/strong>\u00a0to the reward signal that incentivizes language consistently. This was added to the reward because R1-Zero tended to mix languages, making it difficult to read its generations.<\/p>\n<h3 class=\"wp-block-heading\" id=\"4d66\">Step 3: SFT with mixed data<\/h3>\n<p class=\"wp-block-paragraph\" id=\"8f98\">At this point, the model likely has on par (or better) performance than R1-Zero on reasoning tasks. However, this intermediate model wouldn\u2019t be very practical because it wants to reason about any input it receives (e.g., \u201chi there\u201d), which is unnecessary for factual Q&amp;A, translation, and creative writing. That\u2019s why another SFT round is performed with both\u00a0<strong>reasoning (600k examples)<\/strong>\u00a0and\u00a0<strong>non-reasoning (200k examples)<\/strong>\u00a0data.<\/p>\n<p class=\"wp-block-paragraph\" id=\"fca1\">The\u00a0<strong>reasoning data<\/strong>\u00a0here is generated from the resulting model from Step 2. Additionally, examples are included which use an LLM judge to compare model predictions to ground truth answers.<\/p>\n<p class=\"wp-block-paragraph\" id=\"b6df\">The\u00a0<strong>non-reasoning data<\/strong>\u00a0comes from two places. First, the SFT dataset used to train DeepSeek-V3 (the base model). Second, synthetic data generated by DeepSeek-V3. Note that examples are included that do not use CoT so that the model doesn\u2019t use thinking tokens for every response.<\/p>\n<h3 class=\"wp-block-heading\" id=\"7488\">Step 4: RL + RLHF<\/h3>\n<p class=\"wp-block-paragraph\" id=\"33c4\">Finally, another RL round is done, which includes (again) R1-Zero style reasoning training and RL on human feedback. This latter component helps\u00a0<strong>improve the model\u2019s helpfulness and harmlessness<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"54dc\">The result of this entire pipeline is DeepSeek-R1, which excels at reasoning tasks and is an AI assistant you can chat with normally.<\/p>\n<h2 class=\"wp-block-heading\" id=\"e16e\">Accessing R1-Zero and R1<\/h2>\n<p class=\"wp-block-paragraph\" id=\"7fb4\">Another key contribution from DeepSeek is that the weights of the two models described above (and many other distilled versions of R1) were made publicly available. This means there are many ways to access these models, whether using an\u00a0<strong>inference provider or running them locally<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"9853\">Here are a few places that I\u2019ve seen these models.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/www.deepseek.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSeek<\/a>\u00a0(DeepSeek-V3 and DeepSeek-R1)<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/www.together.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">Together<\/a>\u00a0(DeepSeek-V3, DeepSeek-R1, and distillations)<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/hyperbolic.xyz\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hyperbolic<\/a>\u00a0(DeepSeek-V3, DeepSeek-R1-Zero, and DeepSeek-R1)<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/ollama.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ollama<\/a>\u00a0(local) (DeepSeek-V3, DeepSeek-R1, and distillations)<\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hugging Face<\/a>\u00a0(local) (all of the above)<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"83bd\">Conclusions<\/h2>\n<p class=\"wp-block-paragraph\" id=\"fef1\">The release of o1 introduced a new dimension by which LLMs can be improved:\u00a0<strong>test-time compute<\/strong>. Although OpenAI did not release its secret sauce for doing this, 5 months later, DeepSeek was able to replicate this reasoning behavior and publish the technical details of its approach.<\/p>\n<p class=\"wp-block-paragraph\" id=\"b54b\">While current reasoning models have limitations, this is a promising research direction because it has demonstrated that reinforcement learning (without humans) can\u00a0<strong>produce models that learn independently<\/strong>. This (potentially) breaks the implicit limitations of current models, which can only\u00a0<em>recall\u00a0<\/em>and<em>\u00a0remix<\/em>\u00a0information previously seen on the internet (i.e., existing human knowledge).<\/p>\n<p class=\"wp-block-paragraph\" id=\"5119\">The promise of this new RL approach is that models can surpass human understanding (on their own), leading to new scientific and technological breakthroughs that might take us decades to discover (on our own).<\/p>\n<p class=\"wp-block-paragraph\" id=\"221a\"><strong><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f5de.png?ssl=1\" alt=\"\ud83d\uddde\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> Get exclusive access to AI resources and project ideas<\/strong>:\u00a0<a href=\"https:\/\/the-data-entrepreneurs.kit.com\/shaw\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/the-data-entrepreneurs.kit.com\/shaw<\/a><\/p>\n<p class=\"wp-block-paragraph\" id=\"5ed1\"><strong><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f9d1-200d-1f393.png?ssl=1\" alt=\"\ud83e\uddd1\u200d\ud83c\udf93\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> Learn AI in 6 weeks by building it<\/strong>:\u00a0<a href=\"https:\/\/maven.com\/shaw-talebi\/ai-builders-bootcamp?promoCode=AI25\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/maven.com\/shaw-talebi\/ai-builders-bootcamp<\/a><\/p>\n<h3 class=\"wp-block-heading\" id=\"a26a\">References<\/h3>\n<p class=\"wp-block-paragraph\" id=\"44a1\">[1]\u00a0<a href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/\" rel=\"noreferrer noopener\" target=\"_blank\">Learning to reason with LLMs<\/a><\/p>\n<p class=\"wp-block-paragraph\" id=\"731a\">[2]\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2501.12948\" rel=\"noreferrer noopener\" target=\"_blank\">arXiv:2501.12948<\/a><strong>\u00a0[cs.CL]<\/strong><\/p>\n<p class=\"wp-block-paragraph\" id=\"4baf\">[3]\u00a0<a href=\"https:\/\/youtu.be\/7xTGNNLPyMI\" rel=\"noreferrer noopener\" target=\"_blank\">Deep Dive into LLMs Like ChatGPT<\/a><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-to-train-llms-to-think-o1-deepseek-r1\/\">How to Train LLMs to \u201cThink\u201d (o1 &amp; DeepSeek-R1)<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shaw Talebi<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-to-train-llms-to-think-o1-deepseek-r1\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Train LLMs to \u201cThink\u201d (o1 &#038; DeepSeek-R1) In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it \u201cadvanced reasoning\u201d capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1916,311,1895,71,70,1787],"tags":[7,103,1499],"class_list":["post-2188","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-deepseek-r1","category-getting-started","category-large-model-training","category-large-language-models","category-machine-learning","category-reinforcemect-learning","tag-how","tag-model","tag-tokens"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2188"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2188"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2188\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}