{"id":3694,"date":"2025-05-09T07:02:23","date_gmt":"2025-05-09T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/09\/model-compression-make-your-machine-learning-models-lighter-and-faster\/"},"modified":"2025-05-09T07:02:23","modified_gmt":"2025-05-09T07:02:23","slug":"model-compression-make-your-machine-learning-models-lighter-and-faster","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/09\/model-compression-make-your-machine-learning-models-lighter-and-faster\/","title":{"rendered":"Model Compression: Make Your Machine Learning Models Lighter and Faster"},"content":{"rendered":"<p>    Model Compression: Make Your Machine Learning Models Lighter and Faster<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1746560110734\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\">Whether you\u2019re preparing for interviews or building Machine Learning systems at your job, model compression has become a must-have skill. In the era of LLMs, where models are getting larger and larger, the challenges around compressing these models to make them more efficient, smaller, and usable on lightweight machines have never been more relevant.<\/p>\n<p class=\"wp-block-paragraph\">In this article, I will go through four fundamental compression techniques that every ML practitioner should understand and master. I explore pruning, quantization, low-rank factorization, and <a href=\"https:\/\/towardsdatascience.com\/tag\/knowledge-distillation\/\" title=\"Knowledge Distillation\">Knowledge Distillation<\/a>, each offering unique advantages. I will also add some minimal PyTorch code samples for each of these methods.<\/p>\n<p class=\"wp-block-paragraph\">I hope you enjoy the article!<\/p>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Feel free to connect on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/maxime-wolf\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a>!<\/li>\n<li class=\"wp-block-list-item\">Follow me on\u00a0<a href=\"https:\/\/github.com\/maxime7770\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>\u00a0and visit <a href=\"http:\/\/maximewolf.com\/\">my website<\/a> for more content.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Model pruning<\/h2>\n<p class=\"wp-block-paragraph\">Pruning is probably the most intuitive compression technique. The idea is very simple: remove some of the weights of the network, either randomly or <strong>remove the \u201cless important\u201d ones<\/strong>. Of course, when we talk about \u201cremoving\u201d weights in the context of neural networks, it means <strong>setting the weights to zero<\/strong>.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"557\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/blog_img1-1024x557.png?resize=1024%2C557&#038;ssl=1\" alt=\"\" class=\"wp-image-603632\"><figcaption class=\"wp-element-caption\">Model pruning (Image by the author and ChatGPT | Inspiration: [3])<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">Structured vs unstructured pruning<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s start with a <strong>simple heuristic<\/strong>: removing weights smaller than a threshold.<\/p>\n<p class=\"wp-block-paragraph\">[ w\u2019_{ij} = begin{cases} w_{ij} &amp; text{if } |w_{ij}| ge theta_0 \\<br \/>0 &amp; text{if } |w_{ij}| &lt; theta_0<br \/>end{cases} ]<\/p>\n<p class=\"wp-block-paragraph\">Of course, this is not ideal because we would need to find a way to <strong>find the right threshold<\/strong> for our problem! A more practical approach is to remove a specified proportion of weights with <strong>the smallest magnitudes<\/strong> (norm) within one layer. There are 2 common ways of implementing pruning in one layer:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Structured pruning<\/strong>: remove entire components of the network (e.g. a random row from the weight tensor, or a random channel in a convulational layer)<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Unstructured pruning<\/strong>: remove individual weights regardless of their positions and of the structure of the tensor<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">We can also use <strong>global pruning<\/strong> with either of the two above methods. This will remove the chosen proportion of weights across multiple layers, and potentially have different removal rates depending on the number of parameters in each layer.<\/p>\n<p class=\"wp-block-paragraph\">PyTorch makes this pretty straightforward (by the way, you can find all code snippets in <a href=\"https:\/\/github.com\/maxime7770\/Model-Compression\">my GitHub repo<\/a>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import torch.nn.utils.prune as prune\n\n# 1. Random unstructured pruning (20% of weights at random)\nprune.random_unstructured(model.layer, name=\"weight\", amount=0.2)                           \n\n# 2. L1\u2011norm unstructured pruning (20% of smallest weights)\nprune.l1_unstructured(model.layer, name=\"weight\", amount=0.2)\n\n# 3. Global unstructured pruning (40% of all weights by L1 norm across layers)\nprune.global_unstructured(\n    [(model.layer1, \"weight\"), (model.layer2, \"weight\")],\n    pruning_method=prune.L1Unstructured,\n    amount=0.4\n)                                             \n\n# 4. Structured pruning (remove 30% of rows with lowest L2 norm)\nprune.ln_structured(model.layer, name=\"weight\", amount=0.3, n=2, dim=0)<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><em>Note: if you have taken statistics classes, you probably learned regularization-induced methods that also implicitly prune some weights during training, by using L0 or L1 norm regularization. Pruning differs from that because it is applied as a <strong>post-<a href=\"https:\/\/towardsdatascience.com\/tag\/model-compression\/\" title=\"Model Compression\">Model Compression<\/a><\/strong> technique<\/em><\/p>\n<h3 class=\"wp-block-heading\">Why does pruning work? The Lottery Ticket Hypothesis<\/h3>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"683\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/lottery-ticket-1-1024x683.png?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-603255\"><figcaption class=\"wp-element-caption\">Image generated by ChatGPT<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I would like to conclude that section with a quick mention of the <strong>Lottery Ticket Hypothesis<\/strong>, which is both an application of pruning and an interesting explanation of how removing weights can often improve a model. I recommend reading the associated paper ([7]) for more details.<\/p>\n<p class=\"wp-block-paragraph\">Authors use the following procedure:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Train the full model to convergence<\/li>\n<li class=\"wp-block-list-item\">Prune the smallest-magnitude weights (say 10%)<\/li>\n<li class=\"wp-block-list-item\">Reset the remaining weights to their original initialization values<\/li>\n<li class=\"wp-block-list-item\">Retrain this pruned network<\/li>\n<li class=\"wp-block-list-item\">Repeat the process multiple times <\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">After doing this 30 times, you end up with only 0.9<sup>30<\/sup> ~ 4% of the original parameters. And surprisingly, <strong>this network can do as well as the original one<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">This suggests that there is important parameter redundancy. In other words, there exists a <strong>sub-network<\/strong> (\u201ca lottery ticket\u201d) that actually does most of the work! <\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Pruning is one way to unveil this sub-network.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">\n<\/blockquote>\n<\/blockquote>\n<details class=\"wp-block-details is-layout-flow wp-block-details-is-layout-flow\">\n<summary>I recommend this very good video that covers the topic!<\/summary>\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"THIS is why large language models can understand the world\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/UKcWu1l_UNw?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div>\n<\/figure>\n<\/details>\n<h2 class=\"wp-block-heading\" style=\"border-style:none;border-width:0px\">Quantization<\/h2>\n<p class=\"wp-block-paragraph\">While pruning focuses on removing parameters entirely, <a href=\"https:\/\/towardsdatascience.com\/tag\/quantization\/\" title=\"Quantization\">Quantization<\/a> takes a different approach: <strong>reducing the precision<\/strong> of each parameter.<\/p>\n<p class=\"wp-block-paragraph\">Remember that every number in a computer is stored as a sequence of bits. A float32 value uses 32 bits (see example picture below), whereas an 8-bit integer (int8) uses just 8 bits.<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/blog_img2.png?ssl=1\" alt=\"\" class=\"wp-image-603631\"><figcaption class=\"wp-element-caption\">An example of how float32 numbers are represented with 32 bits (Image by the author and ChatGPT | Inspiration: [2])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Most deep learning models are trained using 32-bit floating-point numbers (FP32). Quantization converts these high-precision values to <strong>lower-precision formats<\/strong> like 16-bit floating-point (FP16), 8-bit integers (INT8), or even 4-bit representations.<\/p>\n<p class=\"wp-block-paragraph\">The savings here are obvious: <strong>INT8 requires 75% less memory than FP32<\/strong>. But how do we actually perform this conversion without destroying our model\u2019s performance?<\/p>\n<h3 class=\"wp-block-heading\">The math behind quantization<\/h3>\n<p class=\"wp-block-paragraph\">To convert from floating-point to integer representation, we need to map the continuous range of values to a <strong>discrete set of integers<\/strong>. For INT8 quantization, we\u2019re mapping to 256 possible values (from -128 to 127).<\/p>\n<p class=\"wp-block-paragraph\">Suppose our weights are normalized between -1.0 and 1.0 (common in deep learning):<\/p>\n<p class=\"wp-block-paragraph\">[ text{scale} = frac{text{float_max} \u2013 text{float_min}}{text{int8_max} \u2013 text{int8_min}} = frac{1.0 \u2013 (-1.0)}{127 \u2013 (-128)} = frac{2.0}{255} ] <\/p>\n<p class=\"wp-block-paragraph\">Then, the quantized value is given by<\/p>\n<p class=\"wp-block-paragraph\">[text{quantized_value} = text{round}(frac{text{original_value}}{text{scale}} ] + text{zero_point})<\/p>\n<p class=\"wp-block-paragraph\">Here, <code>zero_point=0<\/code> because we want 0 to be mapped to 0. We can then round this value to the nearest integer to get integers between -127 and 128.<\/p>\n<p class=\"wp-block-paragraph\">And, you guessed it: to get integers back to float, we can use the inverse operation: [text{float_value} = text{integer_value} times text{scale} \u2013 text{zero_point} ]<\/p>\n<p class=\"wp-block-paragraph\"><em>Note: in practice, the scaling factor is determined based on the range values we quantize.<\/em><\/p>\n<h3 class=\"wp-block-heading\">How to apply quantization?<\/h3>\n<p class=\"wp-block-paragraph\">Quantization can be applied at <strong>different stages<\/strong> and with <strong>different strategies<\/strong>. Here are a few techniques worth knowing about: <em>(below, the word \u201cactivation\u201d refers to the output values of each layer)<\/em><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Post-training quantization (PTQ):<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Static Quantization<\/strong>: quantize both weights and activations <strong>offline<\/strong> (after training and before inference)<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Dynamic Quantization<\/strong>: quantize <strong>weights offline<\/strong>, but <strong>activations on-the-fly<\/strong> during inference. This is different from offline quantization because the scaling factor is determined based on the values seen so far during inference.<\/li>\n<\/ul>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Quantize-aware training (QAT):<\/strong> simulate quantization during training by rounding values, but calculations are still done with floating-point numbers. This makes the model learn weights that are more robust to quantization, which will be applied after training. Under the hood, the idea is to <strong>add \u201cfake\u201d operations<\/strong>: <code>x -&gt; dequantize(quantize(x))<\/code>: this new value is close to x, but it still helps the model <strong>tolerate the 8-bit rounding and clipping noise.<\/strong>\n<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import torch.quantization as tq\n\n# 1. Post\u2011training static quantization (weights + activations offline)\nmodel.eval()\nmodel.qconfig = tq.get_default_qconfig('fbgemm') # assign a static quantization config\ntq.prepare(model, inplace=True)\n# we need to use a calibration dataset to determine the ranges of values\nwith torch.no_grad():\n    for data, _ in calibration_data:\n        model(data)\ntq.convert(model, inplace=True) # convert to a fully int8 model\n\n# 2. Post\u2011training dynamic quantization (weights offline, activations on\u2011the\u2011fly)\ndynamic_model = tq.quantize_dynamic(\n    model,\n    {torch.nn.Linear, torch.nn.LSTM}, # layers to quantize\n    dtype=torch.qint8\n)\n\n# 3. Quantization\u2011Aware Training (QAT)\nmodel.train()\nmodel.qconfig = tq.get_default_qat_qconfig('fbgemm')  # set up QAT config\ntq.prepare_qat(model, inplace=True) # insert fake\u2011quant modules\n# [here, train or fine\u2011tune the model as usual]\nqat_model = tq.convert(model.eval(), inplace=False) # convert to real int8 after QAT<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong>Quantization is very flexible!<\/strong> You can apply different precision levels to different parts of the model. For instance, you might quantize most linear layers to 8-bit for maximum speed and memory savings, while leaving critical components (e.g. attention heads, or batch-norm layers) at 16-bit or full-precision.<\/p>\n<h2 class=\"wp-block-heading\">Low-Rank Factorization<\/h2>\n<p class=\"wp-block-paragraph\">Now let\u2019s talk about low-rank factorization \u2014 a method that has been popularized with the rise of LLMs.<\/p>\n<p class=\"wp-block-paragraph\">The key observation: many weight matrices in neural networks have effective ranks much lower than their dimensions suggest. In plain English, that means there is a lot of <strong>redundancy<\/strong> in the parameters.<\/p>\n<p class=\"wp-block-paragraph\"><em>Note: if you have ever used PCA for dimensionality reduction, you have already encountered a form of low-rank approximation. PCA decomposes large matrices into products of smaller, lower-rank factors that retain as much information as possible.<\/em><\/p>\n<h3 class=\"wp-block-heading\">The linear algebra behind low-rank factorization<\/h3>\n<p class=\"wp-block-paragraph\">Take a weight matrix W. Every real matrix can be represented using a <strong>Singular Value Decomposition<\/strong> (SVD):<\/p>\n<p class=\"wp-block-paragraph\">[ W = USigma V^T ]<\/p>\n<p class=\"wp-block-paragraph\">where \u03a3 is a diagonal matrix with singular values in non-increasing order. The number of positive coefficients actually corresponds to the rank of the matrix W.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/blog_img3.png?ssl=1\" alt=\"\" class=\"wp-image-603630\"><figcaption class=\"wp-element-caption\">SVD visualized for a matrix of rank r (Image by the author and ChatGPT | Inspiration: [5])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">To approximate W with a matrix of rank k &lt; r, <strong>we can select the k greatest elements of sigma<\/strong>, and the corresponding first k columns and first k rows of U and V respectively:<\/p>\n<p class=\"wp-block-paragraph\">[ begin{aligned} W_k &amp;= U_k,Sigma_k,V_k^T <br \/>\\[6pt] &amp;= underbrace{U_k,Sigma_k^{1\/2}}_{Ainmathbb{R}^{mtimes k}} underbrace{Sigma_k^{1\/2},V_k^T}_{Binmathbb{R}^{ktimes n}}. end{aligned} ]<\/p>\n<p class=\"wp-block-paragraph\">See how the new matrix can be <strong>decomposed as the product of A and B<\/strong>, with the total number of parameters now being <code>m * k + k * n = k*(m+n)<\/code> instead of <code>m*n<\/code>! This is a <strong>huge improvement<\/strong>, especially when k is much smaller than <code>m<\/code> and <code>n<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">In practice, it\u2019s equivalent to replacing a linear layer x \u2192 Wx with 2 consecutive ones: x \u2192 A(Bx).<\/p>\n<h3 class=\"wp-block-heading\">In PyTorch<\/h3>\n<p class=\"wp-block-paragraph\">We can either apply low-rank factorization <strong>before training<\/strong> (parameterizing each linear layer as <strong>two smaller matrices<\/strong> \u2013 not really a compression method, but a design choice) or <strong>after training<\/strong> (applying a <strong>truncated SVD<\/strong> on weight matrices). The second approach is by far the most common one and is implemented below.<\/p>\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"el1746560007753\"><code class=\"language-python\">import torch\n\n# 1. Extract weight and choose rank\nW = model.layer.weight.data # (m, n)\nk = 64 # desired rank\n\n# 2. Approximate low-rank SVD\nU, S, V = torch.svd_lowrank(W, q=k) # U: (m, k), S: (k, k), V: (n, k)\n\n# 3. Form factors A and B\nA = U * S.sqrt() # [m, k]\nB = V.t() * S.sqrt().unsqueeze(1) # [k, n]\n\n# 4. Replace with two linear layers and insert the matrices A and B\norig = model.layer\nmodel.layer = torch.nn.Sequential(\n    torch.nn.Linear(orig.in_features, k, bias=False),\n    torch.nn.Linear(k, orig.out_features, bias=False),\n)\nmodel.layer[0].weight.data.copy_(B)\nmodel.layer[1].weight.data.copy_(A)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">LoRA: an application of low-rank approximation<\/h3>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/Screenshot-2025-05-03-at-4.29.38%25E2%2580%25AFPM.png?ssl=1\" alt=\"\" class=\"wp-image-603251\"><figcaption class=\"wp-element-caption\">LoRA fine-tuning: W is fixed, A and B are trained (source: [1])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">I think it\u2019s crucial to mention <strong>LoRA<\/strong>: you have probably heard of LoRA (Low-Rank Adaptation) if you have been following <strong>LLM fine-tuning developments<\/strong>. Though not strictly a compression technique, LoRA has become extremely popular for efficiently adapting large language models and <strong>making fine-tuning very efficient<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">The idea is simple: during fine-tuning, rather than modifying the original model weights W, LoRA freezes them and <strong>learn trainable low-rank updates<\/strong>:<\/p>\n<p class=\"wp-block-paragraph\">$$W\u2019 = W + Delta W = W + AB$$<\/p>\n<p class=\"wp-block-paragraph\">where A and B are low-rank matrices. This allows for task-specific adaptation with just a fraction of the parameters. <\/p>\n<p class=\"wp-block-paragraph\">Even better: <strong>QLoRA<\/strong> takes this further by combining quantization with low-rank adaptation!<\/p>\n<p class=\"wp-block-paragraph\">Again, this is a very flexible technique and can be applied at <strong>various stages<\/strong>. Usually, LoRA is applied only on specific layers (for example, Attention layers\u2019 weights).<\/p>\n<h2 class=\"wp-block-heading\">Knowledge Distillation<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"496\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/blog_img4-1-1024x496.png?resize=1024%2C496&#038;ssl=1\" alt=\"\" class=\"wp-image-603629\"><figcaption class=\"wp-element-caption\">Knowledge distillation process (Image by the author and ChatGPT | Inspiration: [4])<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Knowledge distillation takes a fundamentally different approach from what we have seen so far. Instead of modifying an existing model\u2019s parameters, it <strong>transfers the \u201cknowledge\u201d<\/strong> from a <strong>large<\/strong>, complex model (the \u201cteacher\u201d) to a <strong>smaller<\/strong>, more efficient model (the \u201cstudent\u201d). The goal is to train the student model to <strong>mimic the behavior<\/strong> and replicate the performance of the teacher, often an easier task than solving the original problem from scratch.<\/p>\n<h3 class=\"wp-block-heading\">The distillation loss<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s explain some concepts in the case of a classification problem:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <strong>teacher model<\/strong> is usually a large, complex model that achieves high performance on the task at hand<\/li>\n<li class=\"wp-block-list-item\">The <strong>student model<\/strong> is a second, smaller model with a different architecture, but tailored to the same task<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Soft targets<\/strong>: these are the teacher\u2019s model predictions (<strong>probabilities<\/strong>, and not labels!). They will be used by the student model to mimic the teacher\u2019s behaviors. Note that we use raw predictions and not labels because they also contain information about the confidence of the predictions<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Temperature<\/strong>: in addition to the teacher\u2019s prediction, we also use a coefficient T (called temperature) in the softmax function to extract more information from the soft targets. Increasing T softens the distribution and helps the student model give more importance to wrong predictions.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">In practice, it is pretty straightforward to train the student model. We <strong>combine the usual loss<\/strong> (standard cross-entropy loss based on hard labels) <strong>with the \u201cdistillation\u201d loss<\/strong> (based on the teacher\u2019s soft targets):<\/p>\n<p class=\"wp-block-paragraph\">$$ L_{text{total}} = alpha L_{text{hard}} + (1 \u2013 alpha) L_{text{distill}} $$<\/p>\n<p class=\"wp-block-paragraph\">The distillation loss is nothing but the <strong>KL divergence<\/strong> between the teacher and student distribution (<em>you can see it as a measure of the distance between the 2 distributions<\/em>).<\/p>\n<p class=\"wp-block-paragraph\">$$ L_{text{distill}} = D{KL}(q_{text{teacher}} | | q_{text{student}}) = sum_i q_{text{teacher}, i} log left( frac{q_{text{teacher}, i}}{q_{text{student}, i}} right) $$<\/p>\n<p class=\"wp-block-paragraph\">As for the other methods, it is possible and encouraged to <strong>adapt this framework depending on the use case<\/strong>: for example, one can also compare logits and activations from <strong>intermediate layers<\/strong> in the network between the student and teacher model, instead of only comparing the final outputs.<\/p>\n<h3 class=\"wp-block-heading\">Knowledge distillation in practice<\/h3>\n<p class=\"wp-block-paragraph\">Similar to the previous techniques, there are two options:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Offline distillation<\/strong>: the pre-trained teacher model is fixed, and a separate student model is trained to mimic it. Both models are completely separate, and the <strong>teacher\u2019s weights remain frozen<\/strong> during the distillation process.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Online distillation<\/strong>: both models are trained <strong>simultaneously<\/strong>, with knowledge transfer happening during the joint training process.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">And below, an easy way to apply offline distillation (the last code block of this article <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f642.png?ssl=1\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">):<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import torch.nn.functional as F\n\ndef distillation_loss_fn(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):\n    # Standard Cross-Entropy loss with hard labels\n    student_loss = F.cross_entropy(student_logits, labels)\n\n    # Distillation loss with soft targets (KL Divergence)\n    soft_teacher_probs = F.softmax(teacher_logits \/ temperature, dim=-1)\n    soft_student_log_probs = F.log_softmax(student_logits \/ temperature, dim=-1)\n\n\t\t# kl_div expects log probabilities as input for the first argument!\n    distill_loss = F.kl_div(\n        soft_student_log_probs,\n        soft_teacher_probs.detach(), # don't calculate gradients for teacher\n        reduction='batchmean'\n    ) * (temperature ** 2) # optional, a scaling factor\n\n    # Combine losses according to formula\n    total_loss = alpha * student_loss + (1 - alpha) * distill_loss\n    return total_loss\n\nteacher_model.eval()\nstudent_model.train()\nwith torch.no_grad():\n     teacher_logits = teacher_model(inputs)\n\t student_logits = student_model(inputs)\n\t loss = distillation_loss_fn(student_logits, teacher_logits, labels, temperature=T, alpha=alpha)\n\t loss.backward()\n\t optimizer.step()<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">Thanks for reading this article! In the era of LLMs, with billions or even trillions of parameters, model compression has become a fundamental concept, essential in almost every scenario to make models more efficient and easily deployable.<\/p>\n<p class=\"wp-block-paragraph\">But as we have seen, model compression isn\u2019t just about reducing the model size \u2013 it\u2019s about making thoughtful <strong>design decisions<\/strong>. Whether choosing between online and offline methods, compressing the entire network, or targeting specific layers or channels, each choice <strong>significantly impacts performance and usability<\/strong>. Most models now <strong>combine<\/strong> several of these techniques (check out <a href=\"https:\/\/huggingface.co\/RedHatAI\/DeepSeek-R1-Distill-Llama-8B-quantized.w4a16\" data-type=\"link\" data-id=\"https:\/\/huggingface.co\/RedHatAI\/DeepSeek-R1-Distill-Llama-8B-quantized.w4a16\">this model<\/a>, for instance). <\/p>\n<p class=\"wp-block-paragraph\">Beyond introducing you to the main methods, I hope this article also inspires you to <strong>experiment and develop your own creative solutions<\/strong>!<\/p>\n<p class=\"wp-block-paragraph\">Don\u2019t forget to check out the <a href=\"https:\/\/github.com\/maxime7770\/Model-Compression\">GitHub repository<\/a>, where you\u2019ll find all the code snippets <strong>and a side-by-side comparison of the four compression methods<\/strong> discussed in this article.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Feel free to connect on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/maxime-wolf\/\" target=\"_blank\" rel=\"noreferrer noopener\">LinkedIn<\/a>\n<\/li>\n<li class=\"wp-block-list-item\">Follow me on\u00a0<a href=\"https:\/\/github.com\/maxime7770\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub<\/a>\u00a0for more content<\/li>\n<li class=\"wp-block-list-item\">Visit my website:\u00a0<a href=\"http:\/\/maximewolf.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">maximewolf.com<\/a>\n<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<p class=\"wp-block-paragraph\">Check out my previous articles:<\/p>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-towards-data-science wp-block-embed-towards-data-science\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"TQI4Uxokv2\"><p><a href=\"https:\/\/towardsdatascience.com\/llada-the-diffusion-model-that-could-redefine-language-generation\/\">LLaDA: The Diffusion Model That Could Redefine Language Generation<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" title=\"\u201cLLaDA: The Diffusion Model That Could Redefine Language Generation\u201d \u2014 Towards Data Science\" src=\"https:\/\/towardsdatascience.com\/llada-the-diffusion-model-that-could-redefine-language-generation\/embed\/#?secret=5svRtQT6cm#?secret=TQI4Uxokv2\" data-secret=\"TQI4Uxokv2\" width=\"500\" height=\"282\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div>\n<\/figure>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-towards-data-science wp-block-embed-towards-data-science\" datatext=\"el1746410988225\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"YKP7vq5xJz\"><p><a href=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/\">Training Large Language Models: From TRPO to\u00a0GRPO<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" title=\"\u201cTraining Large Language Models: From TRPO to\u00a0GRPO\u201d \u2014 Towards Data Science\" src=\"https:\/\/towardsdatascience.com\/training-large-language-models-from-trpo-to-grpo\/embed\/#?secret=ELbDkWcncf#?secret=YKP7vq5xJz\" data-secret=\"YKP7vq5xJz\" width=\"500\" height=\"282\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div>\n<\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">References<\/h2>\n<ul start=\"1\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">[1] <strong>Hu, E., et al.<em>\u00a0<\/em><\/strong> (2021). <a href=\"https:\/\/arxiv.org\/pdf\/2106.09685\">Low-rank Adaptation of Large Language Models<\/a>. <em>arXiv preprint arXiv:2106.09685<\/em>.<\/li>\n<li class=\"wp-block-list-item\">[2] <strong>Lightning AI<\/strong>. <a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/accelerating-large-language-models-with-mixed-precision-techniques\/\">Accelerating Large Language Models with Mixed Precision Techniques<\/a>. <em>Lightning AI Blog<\/em>.<\/li>\n<li class=\"wp-block-list-item\">[3] <strong>TensorFlow Blog<\/strong>. <a href=\"https:\/\/blog.tensorflow.org\/2019\/05\/tf-model-optimization-toolkit-pruning-API.html\">Pruning API in TensorFlow Model Optimization Toolkit<\/a>. <em>TensorFlow Blog<\/em>, May 2019.<\/li>\n<li class=\"wp-block-list-item\">[4] <strong>Toward AI<\/strong>. <a href=\"https:\/\/pub.towardsai.net\/a-gentle-introduction-to-knowledge-distillation-6240bf8eb8ea\">A Gentle Introduction to Knowledge Distillation<\/a>. <em>Towards AI<\/em>, Aug 2022.<\/li>\n<li class=\"wp-block-list-item\">[5] <strong>Ju, A.<\/strong> <a href=\"https:\/\/www.linkedin.com\/pulse\/ml-algorithm-singular-value-decomposition-angela-ju\/\">ML Algorithm: Singular Value Decomposition (SVD)<\/a>. <em>LinkedIn Pulse<\/em>.<\/li>\n<li class=\"wp-block-list-item\">[6] <strong><strong>Algorithmic Simplicity<\/strong><\/strong>. <a href=\"https:\/\/www.youtube.com\/watch?v=UKcWu1l_UNw\">THIS is why large language models can understand the world<\/a>. <em>YouTube<\/em>, Apr 2023.<\/li>\n<li class=\"wp-block-list-item\">[7] <strong>Frankle, J., &amp; Carbin, M.<\/strong> (2019). <a href=\"https:\/\/arxiv.org\/abs\/1803.03635\">The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks<\/a>. <em>arXiv preprint arXiv:1803.03635<\/em>.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/model-compression-make-your-machine-learning-models-lighter-and-faster\/\">Model Compression: Make Your Machine Learning Models Lighter and Faster<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Maxime Wolf<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/model-compression-make-your-machine-learning-models-lighter-and-faster\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Model Compression: Make Your Machine Learning Models Lighter and Faster Introduction Whether you\u2019re preparing for interviews or building Machine Learning systems at your job, model compression has become a must-have skill. In the era of LLMs, where models are getting larger and larger, the challenges around compressing these models to make them more efficient, smaller, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,83,67,2621,70,2622,1574],"tags":[103,1556,2330],"class_list":["post-3694","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-science","category-deep-dives","category-knowledge-distillation","category-machine-learning","category-model-compression","category-quantization","tag-model","tag-pruning","tag-weights"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3694"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3694"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3694\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3694"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3694"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3694"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}