Advice on processing ~1M jobs/month with LLaMA for cost savings

I’m using GPT-4o-mini to process ~1 million jobs/month. It’s doing things like deduplication, classification, title normalization, and enrichment.

This setup is fast and easy, but the cost is starting to hurt. I’m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral, to reduce inference costs, most likely self-hosted on GPU on Google Coud.

Questions:

* Has anyone done a similar migration? What were your real-world cost savings (e.g., from GPT-4o to self-hosted LLaMA/Mistral)

* Any recommended distillation workflows? I’d be fine using GPT-4o to fine-tune an open model on our own tasks.

* Are there best practices for reducing inference costs even further (e.g., batching, quantization, routing tasks through smaller models first)?

* Is anyone running LLM inference on consumer GPUs for light-to-medium workloads successfully?

Right now, our GPT-4o-mini usage is costing me thousands/month (I’m paying for it out of pocket, no investors). Would love to hear what’s worked for others!

submitted by /u/hamed_n
[link] [comments]

/u/hamed_n

Go to original source