Advice on processing ~1M jobs/month with LLaMA for cost savings
I’m using GPT-4o-mini to process ~1 million jobs/month. It’s doing things like deduplication, classification, title normalization, and enrichment.
This setup is fast and easy, but the cost is starting to hurt. I’m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral, to reduce inference costs, most likely self-hosted on GPU on Google Coud.
Questions:
* Has anyone done a similar migration? What were your real-world cost savings (e.g., from GPT-4o to self-hosted LLaMA/Mistral)
* Any recommended distillation workflows? I’d be fine using GPT-4o to fine-tune an open model on our own tasks.
* Are there best practices for reducing inference costs even further (e.g., batching, quantization, routing tasks through smaller models first)?
* Is anyone running LLM inference on consumer GPUs for light-to-medium workloads successfully?
Right now, our GPT-4o-mini usage is costing me thousands/month (I’m paying for it out of pocket, no investors). Would love to hear what’s worked for others!
submitted by /u/hamed_n
[link] [comments]
/u/hamed_n
Go to original source