{"id":3311,"date":"2025-04-24T07:03:19","date_gmt":"2025-04-24T07:03:19","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/24\/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals\/"},"modified":"2025-04-24T07:03:19","modified_gmt":"2025-04-24T07:03:19","slug":"how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/24\/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals\/","title":{"rendered":"How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI\u2019s simple-evals"},"content":{"rendered":"<p>    How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI\u2019s simple-evals<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1745457890532\" class=\"mdspan-comment\">The recent launch<\/mdspan> of the <a href=\"https:\/\/api-docs.deepseek.com\/news\/news250120\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSeek-R1<\/a> model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost.<\/p>\n<p class=\"wp-block-paragraph\">Beyond the headlines and online buzz, how can we assess the model\u2019s reasoning abilities using recognized benchmarks?\u00a0<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/tag\/deepseek\/\" title=\"Deepseek\">Deepseek<\/a>\u2019s <a href=\"http:\/\/chat.deepseek.com\/\" rel=\"noreferrer noopener\" target=\"_blank\">user interface<\/a> makes it easy to explore its capabilities, but using it programmatically offers deeper insights and more seamless integration into real-world applications. Understanding how to run such models locally also provides enhanced control and offline access.<\/p>\n<p class=\"wp-block-paragraph\">In this article, we explore how to use <strong>Ollama <\/strong>and <strong>OpenAI\u2019s simple-evals<\/strong> to evaluate the reasoning capabilities of DeepSeek-R1\u2019s distilled models based on the famous <strong>GPQA-Diamond <\/strong>benchmark.<\/p>\n<h2 class=\"wp-block-heading\">Contents<\/h2>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>(1) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-one\">What are Reasoning Models?<\/a><br \/><strong>(2) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-two\">What is DeepSeek-R1?<\/a><br \/><strong>(3) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-three\">Understanding Distillation and DeepSeek-R1 Distilled Models<\/a><br \/><strong>(4) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-four\">Selection of Distilled Model<\/a><br \/><strong>(5) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-five\">Benchmarks for Evaluating Reasoning<\/a><br \/><strong>(6) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-six\">Tools Used<\/a><br \/><strong>(7) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-seven\">Results of Evaluation<\/a><br \/><strong>(8) <\/strong><a href=\"https:\/\/towardsdatascience.com\/#heading-eight\">Step-by-Step Walkthrough<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Here is the <a href=\"https:\/\/github.com\/kennethleungty\/DeepSeek-R1-Ollama-Simple-Evals\" rel=\"noreferrer noopener\" target=\"_blank\">link to the accompanying GitHub repo<\/a> for this article.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-one\">(1) What are Reasoning Models?<\/h2>\n<p class=\"wp-block-paragraph\">Reasoning models, such as DeepSeek-R1 and OpenAI\u2019s o-series models (e.g., o1, o3), are large language models (LLMs) trained using reinforcement learning to perform reasoning.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Reasoning models think before they answer, producing a long internal chain of thought before responding. They excel in complex problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-two\">(2) What is DeepSeek-R1?<\/h2>\n<p class=\"wp-block-paragraph\">DeepSeek-R1 is a state-of-the-art open-source LLM designed for <strong>advanced reasoning<\/strong>, introduced in January 2025 in the paper <em>\u201c<\/em><a href=\"https:\/\/arxiv.org\/abs\/2501.12948\" rel=\"noreferrer noopener\" target=\"_blank\"><em>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning<\/em><\/a><em>\u201d<\/em>.<\/p>\n<p class=\"wp-block-paragraph\">The model is a 671-billion-parameter LLM trained with extensive use of reinforcement learning (RL), based on this pipeline:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Two reinforcement stages aimed at discovering improved reasoning patterns and aligning with human preferences<\/li>\n<li class=\"wp-block-list-item\">Two supervised fine-tuning stages serving as the seed for the model\u2019s reasoning and non-reasoning capabilities.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">To be precise, DeepSeek trained two models:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The first model, <strong>DeepSeek-R1-Zero,<\/strong> a reasoning model trained with reinforcement learning, generates data for training the second model, <strong>DeepSeek-R1<\/strong>.\u00a0<\/li>\n<li class=\"wp-block-list-item\">It achieves this by producing reasoning traces, from which only high-quality outputs are retained based on their final results.<\/li>\n<li class=\"wp-block-list-item\">It means that, unlike most models, the RL examples in this training pipeline are not curated by humans but generated by the model.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The outcome is that the model achieved performance comparable to leading models like <a href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/\" rel=\"noreferrer noopener\" target=\"_blank\">OpenAI\u2019s o1 model<\/a> across tasks such as mathematics, coding, and complex reasoning.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-three\">(3) Understanding Distillation and DeepSeek-R1\u2019s Distilled Models<\/h2>\n<p class=\"wp-block-paragraph\">Alongside the full model, they also open-sourced six smaller dense models (also named DeepSeek-R1) of different sizes (1.5B, 7B, 8B, 14B, 32B, 70B), distilled from DeepSeek-R1 based on <a href=\"https:\/\/huggingface.co\/Qwen\" rel=\"noreferrer noopener\" target=\"_blank\">Qwen<\/a> or <a href=\"https:\/\/www.llama.com\/\" rel=\"noreferrer noopener\" target=\"_blank\">Llama<\/a> as the base model.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Distillation <\/strong>is a technique where a smaller model (the \u201cstudent\u201d) is trained to replicate the performance of a larger, more powerful pre-trained model (the \u201cteacher\u201d).\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"510\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Untitled-1024x510.png?resize=1024%2C510&#038;ssl=1\" alt=\"\" class=\"wp-image-601944\"><figcaption class=\"wp-element-caption\">Illustration of DeepSeek-R1 distillation process | Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this case, the teacher is the 671B DeepSeek-R1 model, and the students are the six models distilled using these open-source base models:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Qwen-1.5B\" target=\"_blank\" rel=\"noreferrer noopener\">Qwen2.5\u200a\u2014\u200aMath-1.5B<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Qwen-7B\" target=\"_blank\" rel=\"noreferrer noopener\">Qwen2.5\u200a\u2014\u200aMath-7B<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Qwen-14B\" target=\"_blank\" rel=\"noreferrer noopener\">Qwen2.5\u200a\u2014\u200a14B<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Qwen-32B\" target=\"_blank\" rel=\"noreferrer noopener\">Qwen2.5\u200a\u2014\u200a32B<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Llama-8B\" target=\"_blank\" rel=\"noreferrer noopener\">Llama-3.1\u200a\u2014\u200a8B<\/a><\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Llama-70B\" target=\"_blank\" rel=\"noreferrer noopener\">Llama-3.3\u200a\u2014\u200a70B-Instruct<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">DeepSeek-R1 was used as the teacher model to generate 800,000 training samples, a mix of reasoning and non-reasoning samples, for distillation via <strong>supervised fine-tuning<\/strong> of the base models (1.5B, 7B, 8B, 14B, 32B, and 70B).<\/p>\n<p class=\"wp-block-paragraph\"><strong>So why do we do distillation in the first place?\u00a0<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The goal is to transfer the reasoning abilities of larger models, such as DeepSeek-R1 671B, into smaller, more efficient models. This empowers the smaller models to handle complex reasoning tasks while being faster and more resource-efficient.<\/p>\n<p class=\"wp-block-paragraph\">Furthermore, DeepSeek-R1 has a massive number of parameters (671 billion)<strong>,<\/strong> making it challenging to run on most consumer-grade machines.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Even the most powerful MacBook Pro, with a maximum of 128GB of unified memory, is inadequate to run a 671-billion-parameter model.<\/p>\n<p class=\"wp-block-paragraph\">As such, distilled models open up the possibility of being deployed on devices with limited computational resources.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em><a href=\"https:\/\/unsloth.ai\/blog\/deepseekr1-dynamic\" target=\"_blank\" rel=\"noreferrer noopener\">Unsloth<\/a> achieved an impressive feat by quantizing the original 671B-parameter DeepSeek-R1 model down to just 131GB\u200a\u2014\u200aa remarkable 80% reduction in size. However, a 131GB VRAM requirement remains a significant hurdle.<\/em><\/p>\n<\/blockquote>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-four\">(4) Selection of Distilled Model<\/h2>\n<p class=\"wp-block-paragraph\">With six distilled model sizes to choose from, selecting the right one largely depends on the capabilities of the local device hardware.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">For those with high-performance GPUs or CPUs and a need for maximum performance, the larger DeepSeek-R1 models (32B and up) are ideal\u200a\u2014\u200aeven the quantized 671B version is viable.<\/p>\n<p class=\"wp-block-paragraph\">However, if one has limited resources or prefers quicker generation times (as I do), the smaller distilled variants, such as 8B or 14B, are a better fit.<\/p>\n<p class=\"wp-block-paragraph\"><strong>For this project, I will be using the DeepSeek-R1 distilled <\/strong><a href=\"https:\/\/huggingface.co\/deepseek-ai\/DeepSeek-R1-Distill-Qwen-14B\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>Qwen-14B<\/strong><\/a><strong> model, which aligns with the hardware constraints I faced.<\/strong><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-five\">(5) Benchmarks for Evaluating Reasoning<\/h2>\n<p class=\"wp-block-paragraph\">LLMs are typically evaluated using standardized benchmarks that assess their performance across various tasks, including language understanding, code generation, instruction following, and question answering. Common examples include <a href=\"https:\/\/paperswithcode.com\/sota\/multi-task-language-understanding-on-mmlu\" rel=\"noreferrer noopener\" target=\"_blank\">MMLU<\/a>, <a href=\"https:\/\/paperswithcode.com\/sota\/code-generation-on-humaneval\" rel=\"noreferrer noopener\" target=\"_blank\">HumanEval<\/a>, and <a href=\"https:\/\/paperswithcode.com\/dataset\/mgsm\" rel=\"noreferrer noopener\" target=\"_blank\">MGSM<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">To measure an LLM\u2019s capacity for reasoning, we need more challenging, reasoning-heavy benchmarks that go beyond surface-level tasks. Here are some popular examples focused on evaluating advanced reasoning capabilities:<\/p>\n<h3 class=\"wp-block-heading\">(i) AIME 2024\u200a\u2014\u200aCompetition Math<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <a href=\"https:\/\/paperswithcode.com\/sota\/mathematical-reasoning-on-aime24\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>American Invitational Mathematics Examination (<\/strong>AIME) 2024<\/a> serves as a strong benchmark for evaluating an LLM\u2019s mathematical reasoning capabilities.\u00a0<\/li>\n<li class=\"wp-block-list-item\">It is a challenging math contest with complex, multi-step problems that test an LLM\u2019s ability to interpret intricate questions, apply advanced reasoning, and perform precise symbolic manipulation.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">(ii) Codeforces\u200a\u2014\u200aCompetition Code<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <a href=\"https:\/\/arxiv.org\/abs\/2501.01257v2\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Codeforces Benchmark<\/strong><\/a><strong> <\/strong>evaluates an LLM\u2019s reasoning ability using real competitive programming problems from Codeforces, a platform known for algorithmic challenges.\u00a0<\/li>\n<li class=\"wp-block-list-item\">These problems test an LLM\u2019s capacity to comprehend complex instructions, perform logical and mathematical reasoning, plan multi-step solutions, and generate correct, efficient code.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">(iii) GPQA Diamond\u200a\u2014\u200aPhD-Level Science Questions<\/h3>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">GPQA-Diamond is a curated subset of the <strong>most difficult questions<\/strong> from the broader <a href=\"https:\/\/arxiv.org\/abs\/2311.12022\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>GPQA (Graduate-Level Physics Question Answering)<\/strong><\/a> benchmark, specifically designed to push the limits of LLM reasoning in advanced PhD-level topics.<\/li>\n<li class=\"wp-block-list-item\">While GPQA includes a range of conceptual and calculation-heavy graduate questions, GPQA-Diamond isolates only the most challenging and reasoning-intensive ones.<\/li>\n<li class=\"wp-block-list-item\">It is considered Google-proof, meaning that they are difficult to answer even with unrestricted web access.\u00a0<\/li>\n<li class=\"wp-block-list-item\">Here is an example of a GPQA-Diamond question:<\/li>\n<\/ul>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/kennethleungty\/bbdce4b0c943b531f5d43c28da4935e3.js\"><\/script>\n<\/div>\n<p class=\"wp-block-paragraph\">In this project,<strong> we use GPQA-Diamond as the reasoning benchmark<\/strong>, as <a href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/#:~:text=test-time%20compute-,Evals,-To%20highlight%20the\" rel=\"noreferrer noopener\" target=\"_blank\">OpenAI<\/a> and <a href=\"https:\/\/github.com\/deepseek-ai\/DeepSeek-R1?tab=readme-ov-file#deepseek-r1-evaluation\" rel=\"noreferrer noopener\" target=\"_blank\">DeepSeek<\/a> used it to evaluate their reasoning models.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-six\">(6) Tools Used<\/h2>\n<p class=\"wp-block-paragraph\">For this project, we primarily use <a href=\"http:\/\/www.ollama.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Ollama<\/a> and OpenAI\u2019s <a href=\"https:\/\/github.com\/openai\/simple-evals\">simple-evals<\/a>.<\/p>\n<h3 class=\"wp-block-heading\">(i) Ollama<\/h3>\n<p class=\"wp-block-paragraph\"><a href=\"http:\/\/ollama.com\/\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>Ollama<\/strong><\/a> is an open-source tool that simplifies running LLMs on our computer or a local server.<\/p>\n<p class=\"wp-block-paragraph\">It acts as a manager and runtime, handling tasks such as downloads and environment setup. This allows users to interact with these models without requiring a constant internet connection or relying on cloud services.<\/p>\n<p class=\"wp-block-paragraph\">It supports many open-source LLMs, including DeepSeek-R1, and is cross-platform compatible with macOS, Windows, and Linux. Additionally, it offers a straightforward setup with minimal fuss and efficient resource utilization.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em><strong>Important<\/strong>: Ensure your local device has <strong>GPU access<\/strong> for Ollama, as this dramatically accelerates performance and makes subsequent benchmarking exercises much more efficient as compared to CPU. Run <code>nvidia-smi<\/code> in terminal to check if GPU is detected.<\/em><\/p>\n<\/blockquote>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(ii) OpenAI simple-evals<\/h3>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/github.com\/openai\/simple-evals\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>simple-evals<\/strong><\/a> is a lightweight library designed to evaluate language models using a zero-shot, chain-of-thought prompting approach. It includes famous benchmarks like MMLU, MATH, GPQA, MGSM, and HumanEval, aiming to reflect realistic usage scenarios.<\/p>\n<p class=\"wp-block-paragraph\">Some of you may know about OpenAI\u2019s more famous and comprehensive evaluation library called <a href=\"https:\/\/github.com\/openai\/evals\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>Evals<\/strong><\/a>, which is distinct from simple-evals.<\/p>\n<p class=\"wp-block-paragraph\">In fact, the <a href=\"https:\/\/github.com\/openai\/simple-evals?tab=readme-ov-file#background\" rel=\"noreferrer noopener\" target=\"_blank\">README<\/a> of simple-evals also specifically indicates that it is not intended to replace the <strong>Evals<\/strong> library.<\/p>\n<p class=\"wp-block-paragraph\"><strong>So why are we using simple-evals?<\/strong>\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The simple answer is that <strong>simple-evals<\/strong> comes with built-in evaluation scripts for the reasoning benchmarks we are targeting (such as GPQA), which are missing in <strong>Evals<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">Additionally, I did not find any other tools or platforms, other than simple-evals, that provide a straightforward, <a href=\"https:\/\/towardsdatascience.com\/tag\/python\/\" title=\"Python\">Python<\/a>-native way to run numerous key benchmarks, such as GPQA, particularly when working with Ollama.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-seven\">(7) Results of Evaluation<\/h2>\n<p class=\"wp-block-paragraph\">As part of the evaluation, I selected <strong>20 random questions<\/strong> from the GPQA-Diamond 198-question set for the <strong>14B distilled model<\/strong> to work on. The total time taken was 216 minutes, which is ~11 minutes per question.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The outcome was admittedly disappointing, as it scored only <strong>10%<\/strong>, far below the reported 73.3% score for the 671B DeepSeek-R1 model.<\/p>\n<p class=\"wp-block-paragraph\">The main issue I noticed is that during its intensive internal reasoning,<strong> the model often either failed to produce any answer (e.g., returning reasoning tokens as the final lines of output) or provided a response that did not match the expected multiple-choice format (e.g., Answer: A).<\/strong><\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"508\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/output_screenshot_20250420-1024x508.png?resize=1024%2C508&#038;ssl=1\" alt=\"\" class=\"wp-image-601945\"><figcaption class=\"wp-element-caption\">Evaluation output printout from the 20 examples benchmark run | Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As shown above, many outputs ended up as <code>None<\/code> because the regex logic in simple-evals could not detect the expected answer pattern in the LLM response.<\/p>\n<p class=\"wp-block-paragraph\">While the <a href=\"https:\/\/github.com\/kennethleungty\/DeepSeek-R1-Ollama-Simple-Evals\/blob\/main\/data\/convos_output_20250313_0105.txt\" rel=\"noreferrer noopener\" target=\"_blank\">human-like reasoning logic<\/a> was interesting to observe, I had expected stronger performance in terms of question-answering accuracy.<\/p>\n<p class=\"wp-block-paragraph\">I have also seen online users mention that even the larger 32B model does not perform as well as o1. This has raised doubts about the utility of distilled reasoning models, especially when they struggle to give correct answers despite generating long reasoning.<\/p>\n<p class=\"wp-block-paragraph\">That said, GPQA-Diamond is a highly challenging benchmark, so these models could still be useful for simpler reasoning tasks. Their lower computational demands also make them more accessible.<\/p>\n<p class=\"wp-block-paragraph\">Furthermore, the DeepSeek team recommended conducting multiple tests and averaging the results as part of the benchmarking process\u200a\u2014\u200asomething I omitted due to time constraints.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\" id=\"heading-eight\">(8) Step-by-Step Walkthrough<\/h2>\n<p class=\"wp-block-paragraph\">At this point, we\u2019ve covered the core concepts and key takeaways.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">If you\u2019re ready for a hands-on, technical walkthrough, this section provides a deep dive into the inner workings and step-by-step implementation.\u00a0<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>Check out (or clone) the <a href=\"https:\/\/github.com\/kennethleungty\/DeepSeek-R1-Ollama-Simple-Evals\" target=\"_blank\" rel=\"noreferrer noopener\">accompanying GitHub repo<\/a> to follow along. The requirements for the virtual environment setup can be found <a href=\"https:\/\/github.com\/kennethleungty\/DeepSeek-R1-Ollama-Simple-Evals\/blob\/main\/requirements.txt\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/em><\/p>\n<\/blockquote>\n<h3 class=\"wp-block-heading\">(i) Initial Setup\u200a\u2014\u200aOllama<\/h3>\n<p class=\"wp-block-paragraph\">We begin by downloading Ollama. Visit the <a href=\"https:\/\/ollama.com\/download\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>Ollama download page<\/strong><\/a>, select your operating system, and follow the corresponding installation instructions.<\/p>\n<p class=\"wp-block-paragraph\">Once installation is complete, launch Ollama by double-clicking the Ollama app (for Windows and macOS) or running <code>ollama serve<\/code> in the terminal.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(ii) Initial Setup\u200a\u2014\u200aOpenAI simple-evals<\/h3>\n<p class=\"wp-block-paragraph\">The setup of simple-evals is relatively unique.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">While simple-evals presents itself as a library, <strong>the absence of <\/strong><code>__init__.py<\/code><strong> files in the repository means it is not structured as a proper Python package<\/strong>, leading to import errors after cloning the repo locally.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Since it is also not published to PyPI and lacks standard packaging files like <code>setup.py<\/code> or <code>pyproject.toml<\/code>, it cannot be installed via <code>pip<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">Fortunately, we can utilize <strong>Git submodules<\/strong> as a straightforward workaround.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>A Git submodule lets us include contents of another Git repository inside our own project. It pulls the files from an external repo (e.g., simple-evals), but keeps its history separate.<\/em><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">You can choose one of two ways (A or B) to pull the simple-evals contents:<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>(A) If You Cloned My Project Repo<\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">My project repo already includes <code>simple-evals<\/code> as a submodule, so you can just run:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">git submodule update --init --recursive<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong><em>(B) If You\u2019re Adding It to a Newly Created Project<\/em><\/strong><br \/>To manually add simple-evals as a submodule, run this:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">git submodule add https:\/\/github.com\/openai\/simple-evals.git simple_evals<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong>Note<\/strong>: The <code>simple_evals<\/code> at the end of the command (with an <strong>underscore<\/strong>) is crucial. It sets the folder name, and using a hyphen instead (i.e., simple<strong>\u2013<\/strong>evals) can lead to import issues later.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\"><strong>Final Step (For Both Methods)<\/strong><\/p>\n<p class=\"wp-block-paragraph\">After pulling the repo contents, you must create an empty <code>__init__.py<\/code> in the newly created <code>simple_evals<\/code> folder so that it is importable as a module. You can create it manually, or use the following command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">touch simple_evals\/__init__.py<\/code><\/pre>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(iii) Pull DeepSeek-R1 model via\u00a0Ollama<\/h3>\n<p class=\"wp-block-paragraph\">The next step is to locally download the distilled model of your choice (e.g., 14B) using this command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">ollama pull deepseek-r1:14b<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The list of DeepSeek-R1 models available on Ollama can be found <a href=\"https:\/\/ollama.com\/library\/deepseek-r1\" rel=\"noreferrer noopener\" target=\"_blank\">here<\/a>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(iv) Define configuration<\/h3>\n<p class=\"wp-block-paragraph\">We define the parameters in a configuration YAML file, as shown below:<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/kennethleungty\/a42fba7cf471abaf6ccf1b9615f3bf85.js\"><\/script>\n<\/div>\n<p class=\"wp-block-paragraph\">The model temperature is set to <strong>0.6<\/strong> (as opposed to the typical default value of 0). This follows DeepSeek\u2019s usage recommendations, which suggest a temperature range of 0.5 to 0.7<strong> <\/strong>(0.6 recommended) to <strong>prevent endless repetitions or incoherent outputs.<\/strong><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>Do check out the interestingly unique <a href=\"https:\/\/github.com\/deepseek-ai\/DeepSeek-R1?tab=readme-ov-file#usage-recommendations\" target=\"_blank\" rel=\"noreferrer noopener\">DeepSeek-R1 usage recommendations\u200a<\/a>\u2014\u200aespecially for benchmarking\u200a\u2014\u200ato ensure optimal performance when using DeepSeek-R1 models.<\/em><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\"><code>EVAL_N_EXAMPLES<\/code> is the parameter for setting the number of questions from the full 198-question set to use for evaluation.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(v) Set up Sampler\u00a0code<\/h3>\n<p class=\"wp-block-paragraph\">To support Ollama-based language models within the simple-evals framework, we create a custom wrapper class named <code>OllamaSampler<\/code> saved inside <code>utils\/samplers\/ollama_sampler.py<\/code>.<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/kennethleungty\/82ee71b7be485602ff706f254a19392f.js\"><\/script>\n<\/div>\n<p class=\"wp-block-paragraph\">In this context, a <em>sampler<\/em> is a Python class that generates outputs from a language model based on a given prompt.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Since existing samplers in simple-evals only cover providers like OpenAI and Claude, we need a sampler class that provides a compatible interface for Ollama.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The <code>OllamaSampler<\/code> extracts the GPQA question prompt, sends it to the model with a specified temperature, and returns the plain text response.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">The <code>_pack_message<\/code> method is included to ensure the output format matches what the evaluation scripts in simple-evals expect.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h3 class=\"wp-block-heading\">(vi) Create evaluation run\u00a0script<\/h3>\n<p class=\"wp-block-paragraph\">The following code sets up the evaluation execution in <code>main.py<\/code>, including the use of the <code>GPQAEval<\/code> class from simple-evals to run GPQA benchmarking.<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/kennethleungty\/47681d08362566feff84072b4bedc2fb.js\"><\/script>\n<\/div>\n<p class=\"wp-block-paragraph\">The <code>run_eval()<\/code> function is a configurable evaluation runner that tests LLMs through Ollama on benchmarks like GPQA.<\/p>\n<p class=\"wp-block-paragraph\">It loads settings from the config file, sets up the appropriate evaluation class from simple-evals, and runs the model through a standardized evaluation process. It is saved in <code>main.py<\/code>, which can be executed with <code>python main.py<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">Following the steps above, we have successfully set up and executed the GPQA-Diamond benchmarking on the DeepSeek-R1 distilled model.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Wrapping It\u00a0Up<\/h2>\n<p class=\"wp-block-paragraph\">In this article, we showcased how we can combine tools like Ollama and OpenAI\u2019s simple-evals to explore and benchmark DeepSeek-R1\u2019s distilled models.<\/p>\n<p class=\"wp-block-paragraph\">The distilled models may not yet rival the 671B parameter original model on challenging reasoning benchmarks like GPQA-Diamond. Still, they demonstrate how distillation can expand access to LLM reasoning capabilities.<\/p>\n<p class=\"wp-block-paragraph\">Despite subpar scores in complex PhD-level tasks, these smaller variants may remain viable for less demanding scenarios, paving the way for efficient local deployment on a wider range of hardware.<\/p>\n<h3 class=\"wp-block-heading\">Before you\u00a0go<\/h3>\n<p class=\"wp-block-paragraph\">I welcome you to follow my <a href=\"https:\/\/github.com\/kennethleungty\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/kennethleungty\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a> to stay updated with more engaging and practical content. Meanwhile, have fun benchmarking LLMs with Ollama and simple-evals!<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals\/\">How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI\u2019s simple-evals<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Kenneth Leung<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/how-to-benchmark-deepseek-r1-distilled-models-on-gpqa-using-ollama-and-openais-simple-evals\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI\u2019s simple-evals The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost. Beyond [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,1592,240,71,70,157],"tags":[1593,73,1400],"class_list":["post-3311","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-deepseek","category-editors-pick","category-large-language-models","category-machine-learning","category-python","tag-deepseek","tag-models","tag-reasoning"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3311"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3311"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3311\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3311"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3311"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3311"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}