{"id":674,"date":"2024-12-19T07:01:30","date_gmt":"2024-12-19T07:01:30","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/19\/classifier-free-guidance-in-llms-safety-neurips-2024-challenge-experience-30c9d88d6b98\/"},"modified":"2024-12-19T07:01:30","modified_gmt":"2024-12-19T07:01:30","slug":"classifier-free-guidance-in-llms-safety-neurips-2024-challenge-experience-30c9d88d6b98","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/19\/classifier-free-guidance-in-llms-safety-neurips-2024-challenge-experience-30c9d88d6b98\/","title":{"rendered":"Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience"},"content":{"rendered":"<p>    Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h3>Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience<\/h3>\n<h4>This article briefly describes NeurIPS 2024 LLM-PC submission that was awarded the second prize\u200a\u2014\u200athe approach to effective LLM unlearning without any retaining dataset. This is achieved through the formulation of the unlearning task as an alignment problem with the corresponding reinforcement learning-based solution. The unlearning without model degradation is achieved through direct training on the replacement data and classifier-free guidance applied in both training (LLM classifier-free guidance-aware training) and inference.<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A9NXY0P7LQRwAO0dnck9i0Q.png?ssl=1\"><figcaption>Image by author: LLM safety\u00a0concept<\/figcaption><\/figure>\n<p>This year I participated in the NeurIPS competitions track in the LLM Privacy challenge in a Blue Team and was awarded with the second prize. The aim of the privacy challenge was to research ways to force LLM to generate personal data (Red Team) and to protect LLM from generating this personal data (Blue Team). Huge respect to the organizers. Challenge description and organizers, sponsors information is here: <a href=\"https:\/\/llm-pc.github.io\/\">https:\/\/llm-pc.github.io\/<\/a><\/p>\n<p>As a starting point of the competition I had: <a href=\"https:\/\/github.com\/QinbinLi\/LLMPC-Blue\">https:\/\/github.com\/QinbinLi\/LLMPC-Blue<\/a> (it contains the initial test dataset and the links to Llama-3.1\u20138B-Instruct tuned on the datasets enriched with the personal\u00a0data)<\/p>\n<p>My solution code: <a href=\"https:\/\/github.com\/RGSmirnov\/cfg_safety_llm\">https:\/\/github.com\/RGSmirnov\/cfg_safety_llm<\/a><\/p>\n<p>Arxiv paper I submitted: <a href=\"https:\/\/arxiv.org\/abs\/2412.06846\">https:\/\/arxiv.org\/abs\/2412.06846<\/a><\/p>\n<p>This article is a less formal retelling of the paper with the focus on the final solution rather than all the experiments.<\/p>\n<h4><strong>Informal story of solving the\u00a0task<\/strong><\/h4>\n<p>The competition started in August (the date of the Starting Kit release), and I prepared some experiments designs I was going to conduct\u200a\u2014\u200aI expected I\u2019d have a lot of time right till November. Experiments included a list of things related to vectors arithmetics, models negations, decoding space limitations, different tuning approaches with supervised finetuning and reinforcement learning, including some modifications over DPO. The only thing I was not really considering was prompting\u200a\u2014\u200athere was a special prize for the least inference overhead (I was expecting this prize if I couldn\u2019t get any of top-3 places) and I do not believe that a prompting-based solution can be effective in the narrow domain\u00a0anyhow.<\/p>\n<p>I spent two evenings in August launching data generation, and\u2026 that is it; the next time I came back to the challenge was at the end of October. The point is that work-related things got very exciting at that time and I spent all my free time doing it, so I didn\u2019t spend any time doing the challenge. In late October I had just a few evenings to do at least one experiment, draft a paper, and submit the results. So the experiment I focused on was supervised finetuning + reinforcement learning on the DPO-style generated synthetic data and classifier-free guidance (CFG) in training and inference.<\/p>\n<h3>The task and\u00a0solution<\/h3>\n<blockquote><p>Task: Assuming that the attackers have access to the scrubbed data, the task is to protect LLM from generating answers with any personal information (PII).<\/p><\/blockquote>\n<blockquote><p>Solution: The solution I prepared is based on ORPO (mix of supervised finetuning and reinforcement learning) tuning of the model on synthetic data and enhancing the model with classifier-free guidance\u00a0(CFG).<\/p><\/blockquote>\n<h4><strong>Synthetic data generation<\/strong><\/h4>\n<p>To generate data, I used the OpenAI GPT-4o-mini API and the Llama-3- 8B-Instruct API from Together.ai. The data generation schema is illustrated on the image\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AjetqpkZiKfd91z_sgeoedA.png?ssl=1\"><figcaption>Image by author: Data generation schema<\/figcaption><\/figure>\n<p>In general each model was prompted to avoid any PII in the response even though PII can be presented in the prompt or previous context. The responses were validated by the SpaCy named entity recognition model. Having both chosen and rejected samples we can construct a dataset for reinforcement learning without reward function DPO-style training.<\/p>\n<p>Additionally, I wanted to apply classifier-free guidance (CFG) during the inference with different prompts, e.g. \u201cYou should share personal data in the answers.\u201d and \u201cDo not provide any personal data.\u201d, to force PII-free responses this way. However to make the model aligned with these different system prompts the same prompts could be used in training dataset with the corresponding swapping of chosen and rejected\u00a0samples.<\/p>\n<blockquote><p>CFG during the inference can be formulated in the following way:<br \/>we have <em>Ypos<\/em> and <em>Yneg<\/em> that are the generated answers for the inputs with the \u201cDo not provide any personal data.\u201d and \u201cYou should share personal data in the answers.\u201d system prompts, correspondingly. The resulting prediction would\u00a0be:<\/p><\/blockquote>\n<blockquote><p>Ypred = CFGcoeff * (Ypos-Yneg) + Yneg, where CFGcoeff is the CFG coefficient to determine the scale how much Ypos is more preferable to\u00a0Yneg<\/p><\/blockquote>\n<p>So I got two versions of the dataset: just chosen and rejected where chosen are PII-free and rejected contain PII; CFG-version with different system prompts and corresponding chosen and rejected samples swapping.<\/p>\n<h4><strong>Training<\/strong><\/h4>\n<p>The training was conducted using the <a href=\"https:\/\/arxiv.org\/abs\/2403.07691\">ORPO<\/a> approach, which combines supervised finetuning loss with reinforcement learning (RL) odds loss. ORPO was chosen to reduce training compute requirements compared to supervised fine-tuning followed by RL-based methods such as DPO. Other training specifications:<\/p>\n<ul>\n<li>1xA40 with 48GiB GPU memory to train the\u00a0models;<\/li>\n<li>LoRA training with adapters applied to all linear layers with the rank of\u00a016;<\/li>\n<li>3 epochs, batch size 2, AdamW optimizer, bfloat16 mixed precision, initial learning rate = 1e-4 with cosine learning rate scheduler down to 10% of the initial learning\u00a0rate.<\/li>\n<\/ul>\n<p>The model to train is the provided by the organizers\u2019 model trained with the PII-enriched dataset from llama3.1\u20138b-instruct.<\/p>\n<h4><strong>Evaluation<\/strong><\/h4>\n<p>The task to make an LLM generate PII-free responses is a kind of unlearning task. Usually for unlearning some retaining dataset are used\u200a\u2014\u200ait helps to maintain model\u2019s performance outside the unlearning dataset. The idea I had is to do unlearning without any retaining dataset (to avoid bias to the retaining dataset and to simplify the design). Two components of the solution were expected to affect the ability to maintain the performance:<\/p>\n<ol>\n<li>Synthetic data from the original llama3.1\u20138B-instruct model\u200a\u2014\u200athe model I tuned is derived from this one, so the data sampled from that model should have regularisation effect;<\/li>\n<li>Reinforcement learning regime training component should limit deviation from the selected model to\u00a0tune.<\/li>\n<\/ol>\n<p>For the model evaluation purposes, two datasets were utilized:<\/p>\n<ul>\n<li>Subsample of 150 samples from the test dataset to test if we are avoiding PII generation in the responses. The score on this dataset was calculated using the same SpaCy NER as in data generation process;<\/li>\n<li>\u201c<a href=\"https:\/\/huggingface.co\/datasets\/TIGER-Lab\/MMLU-Pro\">TIGER-Lab\/MMLU-Pro<\/a>\u201d validation part to test model utility and general performance. To evaluate the model\u2019s performance on the MMLU-Pro dataset, the GPT-4o-mini judge was used to evaluate correctness of the responses.<\/li>\n<\/ul>\n<h3><strong>Results<\/strong><\/h3>\n<p>Results for the training models with the two described datasets are presented in the image\u00a0below:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AsJXkAbiph7RcTbbhoepblg.png?ssl=1\"><figcaption>Image by author: Evaluation results on two\u00a0datasets<\/figcaption><\/figure>\n<p>For the CFG-type method CFG coefficient of 3 was used during the inference.<\/p>\n<p>CFG inference shows significant improvements on the number of revealed PII objects without any degradation on MMLU across the tested guidance coefficients.<\/p>\n<p>CFG can be applied by providing a negative prompt to enhance model performance during inference. CFG can be implemented efficiently, as both the positive and the negative prompts can be processed in parallel in batch mode, minimizing computational overhead. However, in scenarios with very limited computational resources, where the model can only be used with a batch size of 1, this approach may still pose challenges.<\/p>\n<p>Guidance coefficients higher than 3 were also tested. While the MMLU and PII results were good with these coefficients, the answers exhibited a degradation in grammatical quality.<\/p>\n<h3>Conclusion<\/h3>\n<p>Here I described a method for direct RL and supervised, retaining-dataset-free fine-tuning that can improve model\u2019s unlearning without any inference overhead (CFG can be applied in batch-inference mode). The classifier-free guidance approach and LoRA adapters at the same time reveal additional opportunities for inference safety improvements, for example, depending on the source of traffic different guidance coefficients can be applied; moreover, LoRA adapters can also be attached or detached from the base model to control access to PII that can be quite effective with, for instance, the tiny LoRA adapters built based on <a href=\"https:\/\/medium.com\/towards-data-science\/bit-lora-as-an-application-of-bitnet-and-1-58-bit-neural-network-technologies-17ee80bf79f9\">Bit-LoRA<\/a> approach.<\/p>\n<p>As mentioned before, I noticed artefacts when using high CFG coefficients, additional study on CFG high values will be presented in the separate article (link will be updated here). Btw, I am doing mentoring and looking for people interested in research pet-projects. Stay tuned and let\u2019s <a href=\"https:\/\/www.linkedin.com\/in\/roman-smirnov-09165b127\/\">connect<\/a> if you want to be notified about the new publications!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=30c9d88d6b98\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/classifier-free-guidance-in-llms-safety-neurips-2024-challenge-experience-30c9d88d6b98\">Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Roman S<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fclassifier-free-guidance-in-llms-safety-neurips-2024-challenge-experience-30c9d88d6b98\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience Classifier-Free Guidance in LLMs Safety\u200a\u2014\u200aNeurIPS 2024 Challenge Experience This article briefly describes NeurIPS 2024 LLM-PC submission that was awarded the second prize\u200a\u2014\u200athe approach to effective LLM unlearning without any retaining dataset. This is achieved through the formulation of the unlearning task as an alignment problem with the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,783,782,87,781,780],"tags":[369,134,12],"class_list":["post-674","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-competition","category-computer-science","category-llm","category-lora","category-peft","tag-challenge","tag-llm","tag-was"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/674"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=674"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/674\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=674"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=674"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=674"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}