{"id":4272,"date":"2025-06-02T07:02:22","date_gmt":"2025-06-02T07:02:22","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/06\/02\/advice_on_processing_1m_jobsmonth_with_llama_for\/"},"modified":"2025-06-02T07:02:22","modified_gmt":"2025-06-02T07:02:22","slug":"advice_on_processing_1m_jobsmonth_with_llama_for","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/06\/02\/advice_on_processing_1m_jobsmonth_with_llama_for\/","title":{"rendered":"Advice on processing ~1M jobs\/month with LLaMA for cost savings"},"content":{"rendered":"<p>    Advice on processing ~1M jobs\/month with LLaMA for cost savings<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I&#8217;m using GPT-4o-mini to process ~1 million jobs\/month. It&#8217;s doing things like deduplication, classification, title normalization, and enrichment.<\/p>\n<p>This setup is fast and easy, but the cost is starting to hurt. I&#8217;m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral, to reduce inference costs, most likely self-hosted on GPU on Google Coud. <\/p>\n<p>Questions:<\/p>\n<p>* Has anyone done a similar migration? What were your real-world cost savings (e.g., from GPT-4o to self-hosted LLaMA\/Mistral)<\/p>\n<p>* Any recommended distillation workflows? I&#8217;d be fine using GPT-4o to fine-tune an open model on our own tasks.<\/p>\n<p>* Are there best practices for reducing inference costs even further (e.g., batching, quantization, routing tasks through smaller models first)?<\/p>\n<p>* Is anyone running LLM inference on consumer GPUs for light-to-medium workloads successfully?<\/p>\n<p>Right now, our GPT-4o-mini usage is costing me thousands\/month (I&#8217;m paying for it out of pocket, no investors). Would love to hear what\u2019s worked for others!<\/p>\n<\/p><\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/hamed_n\"> \/u\/hamed_n <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1l0y4zo\/advice_on_processing_1m_jobsmonth_with_llama_for\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1l0y4zo\/advice_on_processing_1m_jobsmonth_with_llama_for\/\">[comments]<\/a><\/span>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    \/u\/hamed_n<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1l0y4zo\/advice_on_processing_1m_jobsmonth_with_llama_for\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Advice on processing ~1M jobs\/month with LLaMA for cost savings I&#8217;m using GPT-4o-mini to process ~1 million jobs\/month. It&#8217;s doing things like deduplication, classification, title normalization, and enrichment. This setup is fast and easy, but the cost is starting to hurt. I&#8217;m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,99],"tags":[1659,474,2826],"class_list":["post-4272","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-datascience","tag-cost","tag-llama","tag-month"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/4272"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=4272"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/4272\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=4272"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=4272"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=4272"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}