Tag: llama

  • Advice on processing ~1M jobs/month with LLaMA for cost savings

    Advice on processing ~1M jobs/month with LLaMA for cost savings I’m using GPT-4o-mini to process ~1 million jobs/month. It’s doing things like deduplication, classification, title normalization, and enrichment. This setup is fast and easy, but the cost is starting to hurt. I’m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral,…

  • llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models

    llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models Exploring llama.cpp internals and a basic chat program flow Photo by Mathew Schwartz on Unsplash llama.cpp has revolutionized the space of LLM inference by the means of wide adoption and simplicity. It has enabled enterprises and individual developers to deploy LLMs on devices ranging from SBCs…

  • Linearizing Llama

    Linearizing Llama Speeding up Llama: A hybrid approach to attention mechanisms Source: Image by Author (Generated using Gemini 1.5 Flash) In this article, we will see how to replace softmax self-attention in Llama-3.2-1B with hybrid attention combining softmax sliding window and linear attention. This implementation will help us better understand the growing interest in linear attention…

  • Chat with Your Images using Multimodal LLMs

    Chat with Your Images using Multimodal LLMs Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs Learn how to build Llama 3.2-Vision locally in a chat-like mode, and explore its Multimodal skills on a Colab notebook Annotated image by author. Original image by Pixabay. Introduction The integration of vision capabilities with Large Language Models (LLMs) is revolutionizing…