{"id":2407,"date":"2025-03-14T07:02:20","date_gmt":"2025-03-14T07:02:20","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/14\/are-you-still-using-lora-to-fine-tune-your-llm\/"},"modified":"2025-03-14T07:02:20","modified_gmt":"2025-03-14T07:02:20","slug":"are-you-still-using-lora-to-fine-tune-your-llm","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/14\/are-you-still-using-lora-to-fine-tune-your-llm\/","title":{"rendered":"Are You Still Using LoRA to Fine-Tune Your LLM?"},"content":{"rendered":"<p>    Are You Still Using LoRA to Fine-Tune Your LLM?<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">LoRA (<a href=\"https:\/\/arxiv.org\/abs\/2106.09685\">Low Rank Adaptation \u2013 arxiv.org\/abs\/2106.09685<\/a>) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f92f.png?ssl=1\" alt=\"\ud83e\udd2f\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">\u2026 And most are based on a matrix technique I like a lot: the SVD (Singular Value Decomposition). Let\u2019s dive in.<\/p>\n<h2 class=\"wp-block-heading\">LoRA<\/h2>\n<p class=\"wp-block-paragraph\">The original <a href=\"https:\/\/towardsdatascience.com\/tag\/lora\/\" title=\"Lora\">Lora<\/a> insight is that fine-tuning all the weights of a model is overkill. Instead, LoRA freezes the model and only trains a small pair of low-rank \u201cadapter\u201d matrices. See the illustrations below (where W is any matrix of weights in a transformer LLM).<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-1.jpg?ssl=1\" alt=\"\" class=\"wp-image-599597\"><\/figure>\n<p class=\"wp-block-paragraph\">This saves memory and compute cycles since far fewer gradients have to be computed and stored. For example, <a href=\"https:\/\/colab.research.google.com\/drive\/14X1spciHDxCYd5c_L8ZtRLrwbU0Hh9y9?usp=sharing\">here is a Gemma 8B model<\/a> fine-tuned to speak like a pirate using LoRA: only 22M parameters are trainable, 8.5B parameters remain frozen.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"610\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-2-1024x610.png?resize=1024%2C610&#038;ssl=1\" alt=\"\" class=\"wp-image-599598\"><\/figure>\n<p class=\"wp-block-paragraph\">LoRA is very popular. It has even made it as a single-line API into mainstream ML frameworks like Keras:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">gemma.backbone.enable_lora(rank=8)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">But is LoRA the best? Researchers have been trying hard to improve on the formula. Indeed, there are many ways of selecting smaller \u201cadapter\u201d matrices. And since most of them make clever use of the singular value decomposition (SVD) of a matrix, let\u2019s pause for a bit of <a href=\"https:\/\/towardsdatascience.com\/tag\/math\/\" title=\"Math\">Math<\/a>.<\/p>\n<h2 class=\"wp-block-heading\">SVD: the simple math<\/h2>\n<p class=\"wp-block-paragraph\">The SVD is a great tool for understanding the structure of matrices. The technique splits a matrix into three: W = USV<sup>T<\/sup> where U and V are orthogonal (i.e., base changes), and S is the diagonal matrix of sorted singular values. This decomposition always exists.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-3.jpg?ssl=1\" alt=\"\" class=\"wp-image-599599\"><\/figure>\n<p class=\"wp-block-paragraph\">In \u201ctextbook\u201d SVD, U and V are square, while S is a rectangle with singular values on the diagonal and a tail of zeros. In practice, you can work with a square S and a rectangular U or V \u2013 see the picture \u2013 the chopped-off pieces are just multiplications by zero. This \u201ceconomy-sized\u201d SVD is what is used in common libraries, for example, <a href=\"https:\/\/numpy.org\/doc\/2.2\/reference\/generated\/numpy.linalg.svd.html\">numpy.linalg.svd<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">So how can we use this to more efficiently select the weights to train? Let\u2019s quickly go through five recent SVD-based low-rank fine-tuning techniques, with commented illustrations.<\/p>\n<h2 class=\"wp-block-heading\">SVF<\/h2>\n<p class=\"wp-block-paragraph\">The simplest alternative to LoRA is to use the SVD on the model\u2019s weight matrices and then fine-tune the singular values directly. Oddly, this is the most recent technique, called SVF, published in the Transformers\u00b2 paper (<a href=\"https:\/\/arxiv.org\/abs\/2501.06252v2\">arxiv.org\/abs\/2501.06252v2<\/a>).<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-4.jpg?ssl=1\" alt=\"\" class=\"wp-image-599600\"><\/figure>\n<p class=\"wp-block-paragraph\">SVF is much more economical in parameters than LoRA. And as a bonus, it makes tuned models composable. For more info on that, see my Transformers\u00b2 explainer <a href=\"https:\/\/bsky.app\/profile\/martin-gorner.bsky.social\/post\/3lhu5lkrqvd2s\">here,<\/a> but composing two SVF fine-tuned models is just an addition:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"487\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-5-1024x487.png?resize=1024%2C487&#038;ssl=1\" alt=\"\" class=\"wp-image-599601\"><\/figure>\n<h2 class=\"wp-block-heading\">SVFT<\/h2>\n<p class=\"wp-block-paragraph\">Should you need more trainable parameters, the SVFT paper (<a href=\"https:\/\/arxiv.org\/abs\/2405.19597\">arxiv.org\/abs\/2405.19597<\/a>) explores multiple ways of doing that, starting by adding more trainable weights on the diagonal.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-6.jpg?ssl=1\" alt=\"\" class=\"wp-image-599602\"><\/figure>\n<p class=\"wp-block-paragraph\">It also evaluates multiple alternatives like spreading them randomly through the \u201cM\u201d matrix.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-7.jpg?ssl=1\" alt=\"\" class=\"wp-image-599603\"><\/figure>\n<p class=\"wp-block-paragraph\">More importantly, the SVFT paper confirms that having more trainable values than just the diagonal is useful. See their fine-tuning results below.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-8.jpg?ssl=1\" alt=\"\" class=\"wp-image-599604\"><\/figure>\n<p class=\"wp-block-paragraph\">Next come several techniques that split singular values in two sets, \u201clarge\u201d and \u201csmall\u201d. But before we proceed, let\u2019s pause for a bit more SVD math.<\/p>\n<h2 class=\"wp-block-heading\">More SVD math<\/h2>\n<p class=\"wp-block-paragraph\">The SVD is usually seen as a decomposition into three matrices W=USV<sup>T<\/sup> but it can also be thought of as a weighted sum of many rank-1 matrices, weighted by the singular values:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-9.jpg?ssl=1\" alt=\"\" class=\"wp-image-599605\"><\/figure>\n<p class=\"wp-block-paragraph\">Should you want to prove it, express individual matrix elements W<sub>jk<\/sub> using the <strong>USV<sup>T<\/sup><\/strong> form and the formula for matrix multiplication on one hand, the<br \/><strong>\u03a3 s<sub>i<\/sub>u<sub>i<\/sub>v<sub>i<\/sub><sup>T<\/sup><\/strong> form, on the other, simplify using the fact that S is diagonal and notice that it\u2019s the same thing.<\/p>\n<p class=\"wp-block-paragraph\">In this representation, it\u2019s easy to see that you can split the sum in two. And as you can always sort the singular values, you can make this a split between \u201clarge\u201d and \u201csmall\u201d singular values.<\/p>\n<p class=\"wp-block-paragraph\">Going back to the tree-matrix form <strong>W=USV<\/strong><strong><sup>T<\/sup><\/strong>, this is what the split looks like:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-10.jpg?ssl=1\" alt=\"\" class=\"wp-image-599606\"><\/figure>\n<p class=\"wp-block-paragraph\">Based on this formula, two papers have\u00a0explored what happens if you tune only the large singular values or only the small ones, PiSSA and MiLoRA.<\/p>\n<h2 class=\"wp-block-heading\">PiSSA<\/h2>\n<p class=\"wp-block-paragraph\">PiSSA (Principal Singular values and Singular Vectors Adaptation, <a href=\"https:\/\/arxiv.org\/abs\/2404.02948\">arxiv.org\/abs\/2404.02948<\/a>) claims that you should only tune the large principal values. The mechanism is illustrated below:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-11.jpg?ssl=1\" alt=\"\" class=\"wp-image-599607\"><\/figure>\n<p class=\"wp-block-paragraph\">From the paper: \u201cPiSSA is designed to approximate full finetuning by adapting the principal singular components, which are believed to capture the essence of the weight matrices. In contrast, MiLoRA aims to adapt to new tasks while maximally retaining the base model\u2019s knowledge.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The PiSSA paper also has an interesting finding: full fine-tuning is prone to over-fitting. You might get better results in the absolute with a low-rank fine-tuning technique.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-12.jpg?ssl=1\" alt=\"\" class=\"wp-image-599608\"><\/figure>\n<h2 class=\"wp-block-heading\">MiLoRA<\/h2>\n<p class=\"wp-block-paragraph\">MiLoRA (Minor singular component LoRA <a href=\"https:\/\/arxiv.org\/abs\/2406.09044\">arxiv.org\/abs\/2406.09044<\/a>), on the other hand, claims that you should only tune the small principal values. It uses a similar mechanism to PiSSA:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-13.jpg?ssl=1\" alt=\"\" class=\"wp-image-599609\"><\/figure>\n<p class=\"wp-block-paragraph\">Surprisingly, MiLoRA seems to have the upper hand, at least when tuning on math datasets which are probably fairly aligned with the original pre-training. Arguably, PiSSA should be better for bending the behavior of the LLM further from its pre-training.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-14.jpg?ssl=1\" alt=\"\" class=\"wp-image-599610\"><\/figure>\n<h2 class=\"wp-block-heading\">LoRA-XS<\/h2>\n<p class=\"wp-block-paragraph\">Finally, I\u2019d like to mention LoRA-XS (<a href=\"https:\/\/arxiv.org\/abs\/2405.17604\">arxiv.org\/abs\/2405.17604<\/a>). Very similar to PiSSA but slightly different mechanism. It also shows good results with significantly fewer params than LoRA.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-15.jpg?ssl=1\" alt=\"\" class=\"wp-image-599611\"><\/figure>\n<p class=\"wp-block-paragraph\">The paper offers a mathematical explanation of why this setup is \u201cideal\u2019 under two conditions:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">that truncating the bottom principal values from the SVD still offers a good approximation of the weights matrices<\/li>\n<li class=\"wp-block-list-item\">that the fine-tuning data distribution is close to the pre-training one<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Both are questionable IMHO, so I won\u2019t detail the math. Some results:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-16.jpg?ssl=1\" alt=\"\" class=\"wp-image-599612\"><\/figure>\n<p class=\"wp-block-paragraph\">The underlying assumption seems to be that singular values come in \u201clarge\u201d and \u201csmall\u201d varieties but is it true? I made a <a href=\"https:\/\/colab.research.google.com\/drive\/1IgzfW7l2P4aytSaNuOa7QIjIVzKg4aIQ?usp=sharing\">quick Colab<\/a> to check this on Gemma2 9B. Bottom line: 99% of the singular values are in the 0.1 \u2013 1.1 range.\u00a0 I\u2019m not sure partitioning them into \u201clarge\u201d and \u201csmall\u201d makes that much sense.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/LoRA-17.jpg?ssl=1\" alt=\"\" class=\"wp-image-599613\"><\/figure>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">There are many more parameter-efficient fine-tuning techniques. Worth mentioning:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">DoRA (<a href=\"https:\/\/arxiv.org\/abs\/2402.09353\">arxiv.org\/abs\/2402.09353<\/a>) which splits weights into magnitudes and directions then tunes those.<\/li>\n<li class=\"wp-block-list-item\">AdaLoRA (<a href=\"https:\/\/arxiv.org\/abs\/2303.10512\">arxiv.org\/abs\/2303.10512<\/a>) with a complex mechanism for finding the best tuning rank for a given budget of trainable weights.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">My conclusion: to go beyond the LoRA standard with 10x fewer params, I like the simplicity of Transformers\u00b2\u2019s SVF. And if you need more trainable weights, SVFT is an easy extension. Both use all singular values (full rank, no singular value pruning) and are still cheap <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f601.png?ssl=1\" alt=\"\ud83d\ude01\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">. Happy tuning!<\/p>\n<p class=\"wp-block-paragraph\">Note: All illustrations are either created by the author or extracted from arxiv.org papers for comment and discussion purposes.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/are-you-still-using-lora-to-fine-tune-your-llm\/\">Are You Still Using LoRA to Fine-Tune Your LLM?<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Martin G\u00f6rner<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/are-you-still-using-lora-to-fine-tune-your-llm\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Are You Still Using LoRA to Fine-Tune Your LLM? LoRA (Low Rank Adaptation \u2013 arxiv.org\/abs\/2106.09685) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS \u2026 And most are based [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1112,71,87,781,2016,229],"tags":[135,2017,2018],"class_list":["post-2407","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-fine-tuning","category-large-language-models","category-llm","category-lora","category-machine-leaning","category-math","tag-fine","tag-lora","tag-svd"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2407"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2407"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2407\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2407"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2407"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2407"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}