{"id":3478,"date":"2025-05-01T07:02:24","date_gmt":"2025-05-01T07:02:24","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/01\/rl-from-one-example-why-1-shot-rlvr-might-be-the-breakthrough-weve-been-waiting-for\/"},"modified":"2025-05-01T07:02:24","modified_gmt":"2025-05-01T07:02:24","slug":"rl-from-one-example-why-1-shot-rlvr-might-be-the-breakthrough-weve-been-waiting-for","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/01\/rl-from-one-example-why-1-shot-rlvr-might-be-the-breakthrough-weve-been-waiting-for\/","title":{"rendered":"Reinforcement Learning from One Example?"},"content":{"rendered":"<p>    Reinforcement Learning from One Example?<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1746057297011\" class=\"mdspan-comment\">Prompt<\/mdspan> engineering alone won\u2019t get us to production. Fine-tuning is expensive. And reinforcement learning? That\u2019s been reserved for well-funded labs with massive datasets until now.<\/p>\n<p class=\"wp-block-paragraph\">New research from Microsoft and academic collaborators has overturned that assumption. Using <strong>Reinforcement Learning with Verifiable Rewards (RLVR)<\/strong> and just a <strong>single training example<\/strong>, researchers achieved results <em>on par with models trained on over a thousand examples<\/em>, sometimes even better.<\/p>\n<p class=\"wp-block-paragraph\">This improvement isn\u2019t just incremental progress. It\u2019s a rethinking of how we fine-tune large language models (LLMs) for reasoning tasks. In this post, we\u2019ll unpack what 1-shot RLVR is, how it works, and what it means for developers building math agents, automated tutors, and reasoning copilots.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"322\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/unnamed-71-1024x322.png?resize=1024%2C322&#038;ssl=1\" alt=\"\" class=\"wp-image-602859\"><figcaption class=\"wp-element-caption\"><em>RLVR with 1 example (green) can perform as well as using datasets with thousands of examples (blue). <\/em><a data-type=\"link\" data-id=\"https:\/\/arxiv.org\/pdf\/2504.20571\" href=\"https:\/\/arxiv.org\/pdf\/2504.20571\">From the Paper<\/a>.<\/figcaption><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\n<\/blockquote>\n<h2 class=\"wp-block-heading\">1-Shot RLVR: What Is It?<\/h2>\n<p class=\"wp-block-paragraph\">RLVR is a flavor of reinforcement learning where the model is trained using <strong>verifiable reward signals,<\/strong> typically 0\/1 based on whether the output is correct. In contrast to reward models used in <a href=\"https:\/\/towardsdatascience.com\/tag\/rlhf\/\" title=\"Rlhf\">Rlhf<\/a>, RLVR uses hard ground truth.<\/p>\n<p class=\"wp-block-paragraph\">What the authors discovered is that if you apply RLVR to a base model (e.g., Qwen2.5-Math-1.5B) and train it on <strong>just one carefully selected math example<\/strong>, performance on benchmark tasks can <strong>nearly double<\/strong>.<\/p>\n<h2 class=\"wp-block-heading\">The Numbers That Stun<\/h2>\n<p class=\"wp-block-paragraph\">Here\u2019s what happens when you train Qwen2.5-Math-1.5B on just <em>one<\/em> example:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>MATH500 Accuracy:<\/strong> Jumps from <strong>36.0% \u2192 73.6%<\/strong>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Avg. Across 6 Math Benchmarks:<\/strong> Improves from <strong>17.6% \u2192 35.7%<\/strong>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Even using <strong>two examples<\/strong> yielded <strong>74.8%<\/strong> on MATH500 and <strong>36.6%<\/strong> average, slightly <strong>outperforming the full 1.2k dataset<\/strong> the example was selected from.<\/p>\n<p class=\"wp-block-paragraph\">This result wasn\u2019t limited to a fluke. Many different examples produced ~30% or more gains when used individually.<\/p>\n<h2 class=\"wp-block-heading\">Why Does This Approach Work?<\/h2>\n<p class=\"wp-block-paragraph\">The paper introduces several hypotheses and findings:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Policy Gradient Loss Does the Heavy Lifting:<\/strong> Removing this from the training pipeline causes gains to disappear, showing it\u2019s the main driver of improvements.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Entropy Loss Encourages Exploration:<\/strong> Adding entropy regularization, even without reward, boosts performance by over 25%.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Post-Saturation Generalization:<\/strong> Accuracy on the training example quickly hits 100%, yet generalization on test sets <em>keeps improving<\/em>.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Cross-Domain Effects:<\/strong> A geometry example improved performance on algebra and number theory, too.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Self-Reflection Increases:<\/strong> Models trained via 1-shot RLVR show more frequent use of \u201crethink,\u201d \u201crecheck,\u201d and \u201crecalculate.\u201d<\/li>\n<\/ol>\n<h2 class=\"wp-block-heading\">Implications for Developers<\/h2>\n<p class=\"wp-block-paragraph\">If you\u2019re building LLM-powered reasoning tools, math solvers, science tutors, or data agents, this technique offers enormous leverage:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>You don\u2019t need big data<\/strong>: A single example can go a long way.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>You don\u2019t need OpenAI access<\/strong>: It works with open models like Qwen and LLaMA.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>You don\u2019t need human labels<\/strong>: Many examples already exist in curated math datasets like MATH or DeepScaleR.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Imagine building an AI tutor that learns from a single problem and generalizes across the curriculum. That future just got closer.<\/p>\n<h2 class=\"wp-block-heading\">Beyond Math: Early Signs of Transfer<\/h2>\n<p class=\"wp-block-paragraph\">The authors evaluated on the ARC-Challenge and ARC-Easy, non-mathematical reasoning benchmarks.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Here\u2019s what they found for Qwen2.5-Math-1.5B:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Base model:<\/strong> 48.0 (ARC-E), 30.2 (ARC-C)\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>After 1-shot RLVR (\u03c013):<\/strong> 55.8 (ARC-E), 33.4 (ARC-C)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">That\u2019s a gain over even full-dataset RLVR. Training on a math problem helped the model become a better commonsense reasoner.<\/p>\n<h2 class=\"wp-block-heading\">What Makes a Good Example?<\/h2>\n<p class=\"wp-block-paragraph\">Using historical training variance to select high-impact examples (\u03c01 and \u03c013) worked well. But surprisingly, <strong>many examples work<\/strong>, even those with low variance.<\/p>\n<p class=\"wp-block-paragraph\">There\u2019s no perfect recipe yet, but the early insight is promising:<\/p>\n<p class=\"wp-block-paragraph\">\u201cAlmost all examples improve performance when used in 1-shot RLVR.\u201d<\/p>\n<h2 class=\"wp-block-heading\">When One Isn\u2019t Enough<\/h2>\n<p class=\"wp-block-paragraph\">For some models, particularly distilled ones like DeepSeek-R1-Distill-Qwen-1.5B, performance gains from 1-shot RLVR were more modest (~6.9%). But moving to 4-shot or 16-shot setups showed steady improvement.<\/p>\n<p class=\"wp-block-paragraph\">This implies that <strong>model family and training history matter,<\/strong> but the general trend holds: <em>you need far less data than we thought<\/em>.<\/p>\n<h2 class=\"wp-block-heading\">The Role of Entropy: Why Exploration Matters<\/h2>\n<p class=\"wp-block-paragraph\">One of the paper\u2019s most surprising discoveries is that <strong>entropy loss alone<\/strong>, even without rewards, can yield large gains.<\/p>\n<p class=\"wp-block-paragraph\">Example: Training Qwen2.5-Math-1.5B with only entropy loss improves MATH500 from <strong>36.0% to 63.4%<\/strong> in 20 steps.<\/p>\n<p class=\"wp-block-paragraph\">This reveals a powerful principle:<\/p>\n<p class=\"wp-block-paragraph\">Letting models explore more freely helps them generalize even from one example.<\/p>\n<h2 class=\"wp-block-heading\">1-Shot \u2260 Grokking<\/h2>\n<p class=\"wp-block-paragraph\">Post-saturation generalization may remind some of grokking, where models suddenly generalize after long periods of overfitting.<\/p>\n<p class=\"wp-block-paragraph\">But ablation studies show 1-shot RLVR isn\u2019t the same:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It doesn\u2019t rely on weight decay.<\/li>\n<li class=\"wp-block-list-item\">Gains are immediate and sustained.<\/li>\n<li class=\"wp-block-list-item\">It appears tied to policy gradients and entropy-driven exploration.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">The Future: Smarter Data, Smaller Footprints<\/h2>\n<p class=\"wp-block-paragraph\">This paper serves as a timely reminder. More data isn\u2019t always the answer. Better data, better selection, and reinforcement learning, even from one example, can unlock powerful capabilities in your base models.<\/p>\n<p class=\"wp-block-paragraph\"><strong>For developers<\/strong>, this means<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">You can build performant math agents with minimal compute.<\/li>\n<li class=\"wp-block-list-item\">You can use RLVR to fine-tune open models with cheap, verifiable rewards.<\/li>\n<li class=\"wp-block-list-item\">You can beat massive datasets with a single, well-chosen problem.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">How Adaptive Helps You Go from Prototype to Production<\/h2>\n<p class=\"wp-block-paragraph\">While the results of 1-shot RLVR are impressive in research, applying them at scale requires the right tools and infrastructure. That\u2019s where <strong>Adaptive Engine<\/strong> comes in.<\/p>\n<p class=\"wp-block-paragraph\">Whether you\u2019re fine-tuning models on a single math problem or optimizing agents across business domains, Adaptive gives you the full flywheel:<\/p>\n<h3 class=\"wp-block-heading\">Adapt<\/h3>\n<p class=\"wp-block-paragraph\">Outperform frontier models with <strong>reinforcement fine-tuning<\/strong> that works, even with limited data. Adaptive makes it easy to run GRPO or PPO on open models with just a few examples and verifiable rewards.<\/p>\n<h3 class=\"wp-block-heading\">Evaluate<\/h3>\n<p class=\"wp-block-paragraph\">Before you deploy, you need confidence. Adaptive supports <strong>personalized, production-aligned evaluations<\/strong>, so you can benchmark improvements on your real-world workloads, not just abstract benchmarks.<\/p>\n<h3 class=\"wp-block-heading\">\n<strong><em>\u00a0<\/em><\/strong>Serve<\/h3>\n<p class=\"wp-block-paragraph\">With <strong>fast, efficient inference<\/strong>, Adaptive lets you host tuned models wherever you need them, on cloud, edge, or hybrid infrastructure. High performance, low latency.<\/p>\n<p class=\"wp-block-paragraph\">From day-one experimentation to at-scale deployment, Adaptive helps you:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Identify high-impact examples<\/strong> with variance-based scoring.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Run lightweight RL pipelines<\/strong> without wrangling compute.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Measure what matters<\/strong> for your business use case.\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/rl-from-one-example-why-1-shot-rlvr-might-be-the-breakthrough-weve-been-waiting-for\/\">Reinforcement Learning from One Example?<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Derrick Mwiti<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/rl-from-one-example-why-1-shot-rlvr-might-be-the-breakthrough-weve-been-waiting-for\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement Learning from One Example? Prompt engineering alone won\u2019t get us to production. Fine-tuning is expensive. And reinforcement learning? That\u2019s been reserved for well-funded labs with massive datasets until now. New research from Microsoft and academic collaborators has overturned that assumption. Using Reinforcement Learning with Verifiable Rewards (RLVR) and just a single training example, researchers [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,88,1112,70,1787,1879],"tags":[2522,1217,2523],"class_list":["post-3478","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-deep-learning","category-fine-tuning","category-machine-learning","category-reinforcemect-learning","category-rlhf","tag-example","tag-reinforcement","tag-rlvr"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3478"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3478"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3478\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3478"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3478"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3478"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}