Tag: llama

Advice on processing ~1M jobs/month with LLaMA for cost savings

Advice on processing ~1M jobs/month with LLaMA for cost savings I’m using GPT-4o-mini to process ~1 million jobs/month. It’s doing things like deduplication, classification, title normalization, and enrichment. This setup is fast and easy, but the cost is starting to hurt. I’m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral,…

June 2, 2025
llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models

llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models Exploring llama.cpp internals and a basic chat program flow Photo by Mathew Schwartz on Unsplash llama.cpp has revolutionized the space of LLM inference by the means of wide adoption and simplicity. It has enabled enterprises and individual developers to deploy LLMs on devices ranging from SBCs…

January 14, 2025
Linearizing Llama

Linearizing Llama Speeding up Llama: A hybrid approach to attention mechanisms Source: Image by Author (Generated using Gemini 1.5 Flash) In this article, we will see how to replace softmax self-attention in Llama-3.2-1B with hybrid attention combining softmax sliding window and linear attention. This implementation will help us better understand the growing interest in linear attention…

January 11, 2025
Chat with Your Images using Multimodal LLMs

Chat with Your Images using Multimodal LLMs Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook Annotated image by author. Original image by Pixabay. Introduction The integration of vision capabilities with Large Language Models (LLMs) is revolutionizing…

December 6, 2024