{"id":1168,"date":"2025-01-14T07:02:35","date_gmt":"2025-01-14T07:02:35","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/14\/llama-cpp-writing-a-simple-c-inference-program-for-gguf-llm-models-12bc5f58505f\/"},"modified":"2025-01-14T07:02:35","modified_gmt":"2025-01-14T07:02:35","slug":"llama-cpp-writing-a-simple-c-inference-program-for-gguf-llm-models-12bc5f58505f","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/14\/llama-cpp-writing-a-simple-c-inference-program-for-gguf-llm-models-12bc5f58505f\/","title":{"rendered":"llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models"},"content":{"rendered":"<p>    llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Exploring llama.cpp internals and a basic chat program\u00a0flow<\/h4>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*9lZIsENXjJ4eiEfb\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@cadop?utm_source=medium&amp;utm_medium=referral\">Mathew Schwartz<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>llama.cpp has revolutionized the space of LLM inference by the means of wide adoption and simplicity. It has enabled enterprises and individual developers to deploy LLMs on devices ranging from SBCs to multi-GPU clusters. Though working with llama.cpp has been made easy by its language bindings, working in C\/C++ might be a viable choice for performance sensitive or resource constrained scenarios.<\/p>\n<p>This tutorial aims to let readers have a detailed look on how LLM inference is performed using low-level functions coming directly from llama.cpp. We discuss the program flow, llama.cpp constructs and have a simple chat at the\u00a0end.<\/p>\n<p>The C++ code that we will write in this blog is also used in SmolChat, a native Android application that allows users to interact with LLMs\/SLMs in the chat interface, completely on-device. Specifically, the LLMInference class we define ahead is used with the JNI binding to execute GGUF\u00a0models.<\/p>\n<p><a href=\"https:\/\/github.com\/shubham0204\/SmolChat-Android\">GitHub &#8211; shubham0204\/SmolChat-Android: Running any GGUF SLMs\/LLMs locally, on-device in Android<\/a><\/p>\n<p>The code for this tutorial can be found\u00a0here:<\/p>\n<p><a href=\"https:\/\/github.com\/shubham0204\/llama.cpp-simple-chat-interface\">shubham0204\/llama.cpp-simple-chat-interface<\/a><\/p>\n<p>The code is also derived from the <a href=\"https:\/\/github.com\/ggerganov\/llama.cpp\/tree\/master\/examples\/simple-chat\">official <\/a><a href=\"https:\/\/github.com\/ggerganov\/llama.cpp\/tree\/master\/examples\/simple-chat\">simple-chat example<\/a> from llama.cpp.<\/p>\n<h3>Contents<\/h3>\n<ol>\n<li><a href=\"https:\/\/towardsdatascience.com\/#81ee\">About llama.cpp<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/#1f3c\">Setup<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/#fe39\">Loading the\u00a0Model<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/#cbe1\">Performing Inference<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/#4279\">Good Habits: Writing a Destructor<\/a><\/li>\n<li><a href=\"https:\/\/towardsdatascience.com\/#1376\">Running the Application<\/a><\/li>\n<\/ol>\n<h3>About llama.cpp<\/h3>\n<p><a href=\"https:\/\/github.com\/ggerganov\/llama.cpp\">llama.cpp<\/a> is a C\/C++ framework to infer machine learning models defined in the <a href=\"https:\/\/github.com\/ggerganov\/ggml\/blob\/master\/docs\/gguf.md\">GGUF format<\/a> on multiple <em>execution backends<\/em>. It started as a pure C\/C++ implementation of the famous Llama series LLMs from Meta that can be inferred on Apple\u2019s silicon, AVX\/AVX-512, CUDA, and Arm Neon-based environments. It also includes a CLI-based tool llama-cli to run GGUF LLM models and llama-server to execute models via HTTP requests (OpenAI compatible server).<\/p>\n<p>llama.cpp uses <a href=\"https:\/\/github.com\/ggerganov\/ggml\">ggml<\/a>, a low-level framework that provides primitive functions required by deep learning models and abstracts backend implementation details from the user. <a href=\"https:\/\/github.com\/ggerganov\">Georgi Gerganov<\/a> is the creator of ggml and llama.cpp.<\/p>\n<p>The <a href=\"https:\/\/github.com\/ggerganov\/llama.cpp#description\">repository\u2019s README<\/a> also lists wrappers built on top of llama.cpp in other programming languages. Popular tools like <a href=\"https:\/\/github.com\/ollama\/ollama\">Ollama<\/a> and <a href=\"https:\/\/github.com\/lmstudio-ai\">LM Studio<\/a> also use bindings over llama.cpp to enhance user friendliness. The project has no dependencies on other third-party libraries<\/p>\n<h4>How is llama.cpp different from PyTorch\/TensorFlow?<\/h4>\n<p>llama.cpp has <strong><em>emphasis on inference of ML models<\/em><\/strong> from its inception, whereas <a href=\"https:\/\/github.com\/pytorch\/pytorch\">PyTorch<\/a> and <a href=\"https:\/\/github.com\/tensorflow\/tensorflow\">TensorFlow<\/a> are end-to-end solutions offering data processing, model training\/validation, and efficient inference in one\u00a0package.<\/p>\n<blockquote><p>PyTorch and TensorFlow also have their lightweight inference-only extensions namely <a href=\"https:\/\/github.com\/pytorch\/executorch\">ExecuTorch<\/a> and <a href=\"https:\/\/github.com\/tensorflow\/tensorflow\/blob\/master\/tensorflow\/lite\/README.md\">TensorFlow Lite<\/a>\n<\/p><\/blockquote>\n<p>Considering only the inference phase of a model, <strong><em>llama.cpp is lightweight<\/em><\/strong> in its implementation due to the absence of third-party dependencies and an extensive set of available operators or model formats to support. Also, as the name suggests, the project started as an efficient library to infer LLMs (the <a href=\"https:\/\/ai.meta.com\/blog\/large-language-model-llama-meta-ai\/\">Llama model<\/a> from Meta) and continues to <strong><em>support a wide range of open-source LLM architectures<\/em><\/strong>.<\/p>\n<blockquote><p>\n<strong>Analogy<\/strong>: If PyTorch\/TensorFlow are luxurious, power-hungry cruise ships, llama.cpp is small, speedy motorboat. PyTorch\/TF and llama.cpp have their own use-cases.<\/p><\/blockquote>\n<h3>Setup<\/h3>\n<p>We start our implementation in a Linux-based environment (native or WSL) with cmake installed and the GNU\/clang toolchain installed. We\u2019ll compile llama.cpp from source and add it as a shared library to our executable chat\u00a0program.<\/p>\n<p>We create our project directory smol_chat with aexternals directory to store the cloned llama.cpp repository.<\/p>\n<pre>mkdir smol_chat<br>cd smol_chat<br><br>mkdir src<br>mkdir externals<br>touch CMakeLists.txt<br><br>cd externals<br>git clone --depth=1 https:\/\/github.com\/ggerganov\/llama.cpp<\/pre>\n<p>CMakeLists.txt is where we define our build, allowing CMake to compile our C\/C++ code using the default toolchain (GNU\/clang) by including headers and shared libraries from externals\/llama.cpp\u2063.<\/p>\n<pre>cmake_minimum_required(VERSION 3.10)<br>project(llama_inference)<br><br>set(CMAKE_CXX_STANDARD 17)<br>set(LLAMA_BUILD_COMMON On)<br>add_subdirectory(\"${CMAKE_CURRENT_SOURCE_DIR}\/externals\/llama.cpp\")<br><br>add_executable(<br>    chat<br>    src\/LLMInference.cpp src\/main.cpp<br>)<br>target_link_libraries(<br>    chat <br>    PRIVATE<br>    common llama ggml<br>)<\/pre>\n<h3>Loading the\u00a0Model<\/h3>\n<p>We have now defined how our project should be built by CMake. Next, we create a header file LLMInference.h which declares a class containing high-level functions to interact with the LLM. llama.cpp provides a C-style API, thus embedding it within a class will help us abstract\/hide the inner working\u00a0details.<\/p>\n<pre>#ifndef LLMINFERENCE_H<br>#define LLMINFERENCE_H<br><br>#include \"common.h\"<br>#include \"llama.h\"<br>#include &lt;string&gt;<br>#include &lt;vector&gt;<br><br>class LLMInference {<br><br>    \/\/ llama.cpp-specific types<br>    llama_context* _ctx;<br>    llama_model* _model;<br>    llama_sampler* _sampler;<br>    llama_batch _batch;<br>    llama_token _currToken;<br>    <br>    \/\/ container to store user\/assistant messages in the chat<br>    std::vector&lt;llama_chat_message&gt; _messages;<br>    \/\/ stores the string generated after applying<br>    \/\/ the chat-template to all messages in `_messages`<br>    std::vector&lt;char&gt; _formattedMessages;<br>    \/\/ stores the tokens for the last query<br>    \/\/ appended to `_messages`<br>    std::vector&lt;llama_token&gt; _promptTokens;<br>    int _prevLen = 0;<br><br>    \/\/ stores the complete response for the given query<br>    std::string _response = \"\";<br><br>    public:<br><br>    void loadModel(const std::string&amp; modelPath, float minP, float temperature);<br><br>    void addChatMessage(const std::string&amp; message, const std::string&amp; role);<br>    <br>    void startCompletion(const std::string&amp; query);<br><br>    std::string completionLoop();<br><br>    void stopCompletion();<br><br>    ~LLMInference();<br>};<br><br>#endif<\/pre>\n<p>The private members declared in the header above will be used in the implementation of the public member functions described in the further sections of the blog. Let us define each of these member functions in LLMInference.cpp\u00a0.<\/p>\n<pre>#include \"LLMInference.h\"<br>#include &lt;cstring&gt;<br>#include &lt;iostream&gt;<br><br>void LLMInference::loadModel(const std::string&amp; model_path, float min_p, float temperature) {<br>    \/\/ create an instance of llama_model<br>    llama_model_params model_params = llama_model_default_params();<br>    _model = llama_load_model_from_file(model_path.data(), model_params);<br><br>    if (!_model) {<br>        throw std::runtime_error(\"load_model() failed\");<br>    }<br><br>    \/\/ create an instance of llama_context<br>    llama_context_params ctx_params = llama_context_default_params();<br>    ctx_params.n_ctx = 0;               \/\/ take context size from the model GGUF file<br>    ctx_params.no_perf = true;          \/\/ disable performance metrics<br>    _ctx = llama_new_context_with_model(_model, ctx_params);<br><br>    if (!_ctx) {<br>        throw std::runtime_error(\"llama_new_context_with_model() returned null\");<br>    }<br><br>    \/\/ initialize sampler<br>    llama_sampler_chain_params sampler_params = llama_sampler_chain_default_params();<br>    sampler_params.no_perf = true;      \/\/ disable performance metrics<br>    _sampler = llama_sampler_chain_init(sampler_params);<br>    llama_sampler_chain_add(_sampler, llama_sampler_init_min_p(min_p, 1));<br>    llama_sampler_chain_add(_sampler, llama_sampler_init_temp(temperature));<br>    llama_sampler_chain_add(_sampler, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));<br><br>    _formattedMessages = std::vector&lt;char&gt;(llama_n_ctx(_ctx));<br>    _messages.clear();<br>}<\/pre>\n<p>llama_load_model_from_filereads the model from the file using llama_load_model internally and populates the llama_model instance using the given llama_model_params\u00a0. The user can give the parameters, but we can get a pre-initialized <em>default<\/em> struct for it with llama_model_default_params\u00a0.<\/p>\n<p>llama_context represents the execution environment for the GGUF model loaded. The llama_new_context_with_model instantiates a new llama_context and prepares a backend for execution by either reading the llama_model_params or by automatically detecting the available backends. It also initializes the K-V cache, which is important in the decoding or inference step. A backend scheduler that manages computations across multiple backends is also initialized.<\/p>\n<p><a href=\"https:\/\/medium.com\/@aalokpatwa\/optimizing-llm-inference-managing-the-kv-cache-34d961ead936\">Optimizing LLM Inference: Managing the KV Cache<\/a><\/p>\n<p>A llama_sampler determines how we sample\/choose tokens from the probability distribution derived from the outputs (logits) of the model (specifically the decoder of the LLM). LLMs assign a probability to each token present in the vocabulary, representing the chances of the token appearing next in the sequence. The temperature and min-p that we are setting with llama_sampler_init_temp and llama_sampler_init_min_p are two parameters controlling the token sampling\u00a0process.<\/p>\n<p><a href=\"https:\/\/rumn.medium.com\/setting-top-k-top-p-and-temperature-in-llms-3da3a8f74832\">Setting Top-K, Top-P and Temperature in LLMs<\/a><\/p>\n<h3>Performing Inference<\/h3>\n<p>There are multiple steps involved in the inference process that takes a text query from the user as input and returns the LLM\u2019s response.<\/p>\n<h4><strong>1. Applying the chat template to the\u00a0queries<\/strong><\/h4>\n<p>For an LLM, the incoming messages are classified as belonging to three roles, user\u00a0, assistant and system\u00a0. user and assistant messages given by the user and the LLM, respectively, whereas system denotes a system-wide prompt that is followed across the entire conversation. Each message consists of a role and content where content is the actual text and role is any one of the three\u00a0roles.<\/p>\n<pre>&lt;example&gt;<\/pre>\n<p>The system prompt is the first message of the conversation. In our code, the messages are stored as a std::vector&lt;llama_chat_message&gt; named _messages where llama_chat_message is a llama.cpp struct with role and content attributes. We use the llama_chat_apply_template function from llama.cpp to apply the chat template stored in the GGUF file as metadata. We store the string or std::vector&lt;char&gt; obtained after applying the chat template in _formattedMessages\u00a0.<\/p>\n<h4>2. Tokenization<\/h4>\n<p>Tokenization is the process of dividing a given text into smaller parts (tokens). We assign each part\/token a unique integer ID, thus transforming the input text to a sequence of integers that form the input to the LLM. llama.cpp provides the common_tokenize or llama_tokenize functions to perform tokenization, where common_tokenize returns the sequence of tokens as a std::vector&lt;llama_token&gt;\u00a0.<\/p>\n<pre>void LLMInference::startCompletion(const std::string&amp; query) {<br>    addChatMessage(query, \"user\");<br><br>    \/\/ apply the chat-template <br>    int new_len = llama_chat_apply_template(<br>            _model,<br>            nullptr,<br>            _messages.data(),<br>            _messages.size(),<br>            true,<br>            _formattedMessages.data(),<br>            _formattedMessages.size()<br>    );<br>    if (new_len &gt; (int)_formattedMessages.size()) {<br>        \/\/ resize the output buffer `_formattedMessages`<br>        \/\/ and re-apply the chat template<br>        _formattedMessages.resize(new_len);<br>        new_len = llama_chat_apply_template(_model, nullptr, _messages.data(), _messages.size(), true, _formattedMessages.data(), _formattedMessages.size());<br>    }<br>    if (new_len &lt; 0) {<br>        throw std::runtime_error(\"llama_chat_apply_template() in LLMInference::start_completion() failed\");<br>    }<br>    std::string prompt(_formattedMessages.begin() + _prevLen, _formattedMessages.begin() + new_len);<br>    <br>    \/\/ tokenization<br>    _promptTokens = common_tokenize(_model, prompt, true, true);<br><br>    \/\/ create a llama_batch containing a single sequence<br>    \/\/ see llama_batch_init for more details<br>    _batch.token = _promptTokens.data();<br>    _batch.n_tokens = _promptTokens.size();<br>}<\/pre>\n<p>In the code, we apply the chat template and perform tokenization in the LLMInference::startCompletion method and then create a llama_batch instance holding the final inputs for the\u00a0model.<\/p>\n<h4>3. Decoding, Sampling and the KV\u00a0Cache<\/h4>\n<p>As highlighted earlier, LLMs generate a response by successively predicting the next token in the given sequence. LLMs are also trained to predict a special end-of-generation (EOG) token, indicating the end of the sequence of the predicted tokens. The completion_loop function returns the next token in the sequence and keeps getting called until the token it returns is the EOG\u00a0token.<\/p>\n<ul>\n<li>Using llama_n_ctx and the llama_get_kv_cached_used_cells we determine the length of the context we have utilized for storing the inputs. Currently, we throw an error if the length of the tokenized inputs exceeds the context\u00a0size.<\/li>\n<li>llama_decode performs a forward-pass of the model, given the inputs in _batch\u00a0.<\/li>\n<li>Using the _sampler initialized in the LLMInference::loadModel we sample or choose a token as our prediction and store it in _currToken\u00a0. We check if the token is an EOG token and then return an &#8220;EOG&#8221; indicating that the text generation loop calling LLMInference::completionLoop should be terminated. On termination, we append a new message to _messages which is the complete _response given by the LLM with role assistant\u00a0.<\/li>\n<li>_currToken is still an integer, which is converted to a string token piece by the common_token_to_piece function. This string token is returned from the completionLoop method.<\/li>\n<li>We need to reinitialize _batch to ensure it now only contains _currToken and not the entire input sequence, i.e. _promptTokens\u00a0. This is because the \u2018keys\u2019 and \u2018values\u2019 for all previous tokens have been cached. This reduces the inference time by avoiding the computation of all \u2018keys\u2019 and \u2018values\u2019 for all tokens in _promptTokens\u00a0.<\/li>\n<\/ul>\n<pre>std::string LLMInference::completionLoop() {<br>    \/\/ check if the length of the inputs to the model<br>    \/\/ have exceeded the context size of the model<br>    int contextSize = llama_n_ctx(_ctx);<br>    int nCtxUsed = llama_get_kv_cache_used_cells(_ctx);<br>    if (nCtxUsed + _batch.n_tokens &gt; contextSize) {<br>        std::cerr &lt;&lt; \"context size exceeded\" &lt;&lt; 'n';<br>        exit(0);<br>    }<br>    \/\/ run the model<br>    if (llama_decode(_ctx, _batch) &lt; 0) {<br>        throw std::runtime_error(\"llama_decode() failed\");<br>    }<br><br>    \/\/ sample a token and check if it is an EOG (end of generation token)<br>    \/\/ convert the integer token to its corresponding word-piece<br>    _currToken = llama_sampler_sample(_sampler, _ctx, -1);<br>    if (llama_token_is_eog(_model, _currToken)) {<br>        addChatMessage(strdup(_response.data()), \"assistant\");<br>        _response.clear();<br>        return \"[EOG]\";<br>    }<br>    std::string piece = common_token_to_piece(_ctx, _currToken, true);<br> <br><br>    \/\/ re-init the batch with the newly predicted token<br>    \/\/ key, value pairs of all previous tokens have been cached<br>    \/\/ in the KV cache<br>    _batch.token = &amp;_currToken;<br>    _batch.n_tokens = 1;<br><br>    return piece;<br>}<\/pre>\n<ul>\n<li>Also, for each query made by the user, LLM takes as input the entire tokenized conversation (all messages stored in _messages ). If we tokenize the entire conversation every time in the startCompletion method, the preprocessing time and thus the overall inference time will increase as the conversation gets\u00a0longer.<\/li>\n<li>To avoid this computation, we only need to tokenize the latest message\/query added to _messages\u00a0. The length up to which messages in _formattedMessages have been tokenized is stored in _prevLen\u00a0. At the end of response generation, i.e. in LLMInference::stopCompletion\u00a0, we update the value of _prevLen\u00a0, by appending the LLM\u2019s response to _messages and using the return value of llama_chat_apply_template\u00a0.<\/li>\n<\/ul>\n<pre>void LLMInference::stopCompletion() {<br>    _prevLen = llama_chat_apply_template(<br>            _model,<br>            nullptr,<br>            _messages.data(),<br>            _messages.size(),<br>            false,<br>            nullptr,<br>            0<br>    );<br>    if (_prevLen &lt; 0) {<br>        throw std::runtime_error(\"llama_chat_apply_template() in LLMInference::stop_completion() failed\");<br>    }<br>}<\/pre>\n<h3>Good Habits: Writing a Destructor<\/h3>\n<p>We implement a destructor method that deallocates dynamically-allocated objects, both in _messages and llama. cpp internally.<\/p>\n<pre>LLMInference::~LLMInference() {<br>    \/\/ free memory held by the message text in messages<br>    \/\/ (as we had used strdup() to create a malloc'ed copy)<br>    for (llama_chat_message &amp;message: _messages) {<br>        delete message.content;<br>    }<br>    llama_kv_cache_clear(_ctx);<br>    llama_sampler_free(_sampler);<br>    llama_free(_ctx);<br>    llama_free_model(_model);<br>}<\/pre>\n<h3>Writing a Small CMD Application<\/h3>\n<p>We create a small interface that allows us to have a conversion with the LLM. This includes instantiating the LLMInference class and calling all methods that we defined in the previous sections.<\/p>\n<pre>#include \"LLMInference.h\"<br>#include &lt;memory&gt;<br>#include &lt;iostream&gt;<br><br>int main(int argc, char* argv[]) {<br><br>    std::string modelPath = \"smollm2-360m-instruct-q8_0.gguf\";<br>    float temperature = 1.0f;<br>    float minP = 0.05f;<br>    std::unique_ptr&lt;LLMInference&gt; llmInference = std::make_unique&lt;LLMInference&gt;();<br>    llmInference-&gt;loadModel(modelPath, minP, temperature);<br><br>    llmInference-&gt;addChatMessage(\"You are a helpful assistant\", \"system\");<br><br>    while (true) {<br>        std::cout &lt;&lt; \"Enter query:n\";<br>        std::string query;<br>        std::getline(std::cin, query);<br>        if (query == \"exit\") {<br>            break;<br>        }<br>        llmInference-&gt;startCompletion(query);<br>        std::string predictedToken;<br>        while ((predictedToken = llmInference-&gt;completionLoop()) != \"[EOG]\") {<br>            std::cout &lt;&lt; predictedToken;<br>            fflush(stdout);<br>        }<br>        std::cout &lt;&lt; 'n';<br>    }<br><br>    return 0;<br>}<\/pre>\n<h3>Running the Application<\/h3>\n<p>We use the CMakeLists.txt authored in one of the previous sections that use it to create a Makefile which will compile the code and create an executable ready for\u00a0use.<\/p>\n<pre>mkdir build<br>cd build<br>cmake ..<br>make<br>.\/chat<\/pre>\n<p>Here\u2019s how the output\u00a0looks:<\/p>\n<pre>register_backend: registered backend CPU (1 devices)<br>register_device: registered device CPU (11th Gen Intel(R) Core(TM) i3-1115G4 @ 3.00GHz)<br>llama_model_loader: loaded meta data with 33 key-value pairs and 290 tensors from \/home\/shubham\/CPP_Projects\/llama-cpp-inference\/models\/smollm2-360m-instruct-q8_0.gguf (version GGUF V3 (latest))<br>llama_model_loader: Dumping metadata keys\/values. Note: KV overrides do not apply in this output.<br>llama_model_loader: - kv   0:                       general.architecture str              = llama<br>llama_model_loader: - kv   1:                               general.type str              = model<br>llama_model_loader: - kv   2:                               general.name str              = Smollm2 360M 8k Lc100K Mix1 Ep2<br>llama_model_loader: - kv   3:                       general.organization str              = Loubnabnl<br>llama_model_loader: - kv   4:                           general.finetune str              = 8k-lc100k-mix1-ep2<br>llama_model_loader: - kv   5:                           general.basename str              = smollm2<br>llama_model_loader: - kv   6:                         general.size_label str              = 360M<br>llama_model_loader: - kv   7:                            general.license str              = apache-2.0<br>llama_model_loader: - kv   8:                          general.languages arr[str,1]       = [\"en\"]<br>llama_model_loader: - kv   9:                          llama.block_count u32              = 32<br>llama_model_loader: - kv  10:                       llama.context_length u32              = 8192<br>llama_model_loader: - kv  11:                     llama.embedding_length u32              = 960<br>llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 2560<br>llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 15<br>llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 5<br>llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 100000.000000<br>llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010<br>llama_model_loader: - kv  17:                          general.file_type u32              = 7<br>llama_model_loader: - kv  18:                           llama.vocab_size u32              = 49152<br>llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 64<br>llama_model_loader: - kv  20:            tokenizer.ggml.add_space_prefix bool             = false<br>llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false<br>llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2<br>llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = smollm<br>llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,49152]   = [\"&lt;|endoftext|&gt;\", \"&lt;|im_start|&gt;\", \"&lt;|...<br>llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...<br>llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,48900]   = [\"\u0120 t\", \"\u0120 a\", \"i n\", \"h e\", \"\u0120 \u0120...<br>llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 1<br>llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2<br>llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 0<br>llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 2<br>llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...<br>llama_model_loader: - kv  32:               general.quantization_version u32              = 2<br>llama_model_loader: - type  f32:   65 tensors<br>llama_model_loader: - type q8_0:  225 tensors<br>llm_load_vocab: control token:      7 '&lt;gh_stars&gt;' is not marked as EOG<br>llm_load_vocab: control token:     13 '&lt;jupyter_code&gt;' is not marked as EOG<br>llm_load_vocab: control token:     16 '&lt;empty_output&gt;' is not marked as EOG<br>llm_load_vocab: control token:     11 '&lt;jupyter_start&gt;' is not marked as EOG<br>llm_load_vocab: control token:     10 '&lt;issue_closed&gt;' is not marked as EOG<br>llm_load_vocab: control token:      6 '&lt;filename&gt;' is not marked as EOG<br>llm_load_vocab: control token:      8 '&lt;issue_start&gt;' is not marked as EOG<br>llm_load_vocab: control token:      3 '&lt;repo_name&gt;' is not marked as EOG<br>llm_load_vocab: control token:     12 '&lt;jupyter_text&gt;' is not marked as EOG<br>llm_load_vocab: control token:     15 '&lt;jupyter_script&gt;' is not marked as EOG<br>llm_load_vocab: control token:      4 '&lt;reponame&gt;' is not marked as EOG<br>llm_load_vocab: control token:      1 '&lt;|im_start|&gt;' is not marked as EOG<br>llm_load_vocab: control token:      9 '&lt;issue_comment&gt;' is not marked as EOG<br>llm_load_vocab: control token:      5 '&lt;file_sep&gt;' is not marked as EOG<br>llm_load_vocab: control token:     14 '&lt;jupyter_output&gt;' is not marked as EOG<br>llm_load_vocab: special tokens cache size = 17<br>llm_load_vocab: token to piece cache size = 0.3170 MB<br>llm_load_print_meta: format           = GGUF V3 (latest)<br>llm_load_print_meta: arch             = llama<br>llm_load_print_meta: vocab type       = BPE<br>llm_load_print_meta: n_vocab          = 49152<br>llm_load_print_meta: n_merges         = 48900<br>llm_load_print_meta: vocab_only       = 0<br>llm_load_print_meta: n_ctx_train      = 8192<br>llm_load_print_meta: n_embd           = 960<br>llm_load_print_meta: n_layer          = 32<br>llm_load_print_meta: n_head           = 15<br>llm_load_print_meta: n_head_kv        = 5<br>llm_load_print_meta: n_rot            = 64<br>llm_load_print_meta: n_swa            = 0<br>llm_load_print_meta: n_embd_head_k    = 64<br>llm_load_print_meta: n_embd_head_v    = 64<br>llm_load_print_meta: n_gqa            = 3<br>llm_load_print_meta: n_embd_k_gqa     = 320<br>llm_load_print_meta: n_embd_v_gqa     = 320<br>llm_load_print_meta: f_norm_eps       = 0.0e+00<br>llm_load_print_meta: f_norm_rms_eps   = 1.0e-05<br>llm_load_print_meta: f_clamp_kqv      = 0.0e+00<br>llm_load_print_meta: f_max_alibi_bias = 0.0e+00<br>llm_load_print_meta: f_logit_scale    = 0.0e+00<br>llm_load_print_meta: n_ff             = 2560<br>llm_load_print_meta: n_expert         = 0<br>llm_load_print_meta: n_expert_used    = 0<br>llm_load_print_meta: causal attn      = 1<br>llm_load_print_meta: pooling type     = 0<br>llm_load_print_meta: rope type        = 0<br>llm_load_print_meta: rope scaling     = linear<br>llm_load_print_meta: freq_base_train  = 100000.0<br>llm_load_print_meta: freq_scale_train = 1<br>llm_load_print_meta: n_ctx_orig_yarn  = 8192<br>llm_load_print_meta: rope_finetuned   = unknown<br>llm_load_print_meta: ssm_d_conv       = 0<br>llm_load_print_meta: ssm_d_inner      = 0<br>llm_load_print_meta: ssm_d_state      = 0<br>llm_load_print_meta: ssm_dt_rank      = 0<br>llm_load_print_meta: ssm_dt_b_c_rms   = 0<br>llm_load_print_meta: model type       = 3B<br>llm_load_print_meta: model ftype      = Q8_0<br>llm_load_print_meta: model params     = 361.82 M<br>llm_load_print_meta: model size       = 366.80 MiB (8.50 BPW) <br>llm_load_print_meta: general.name     = Smollm2 360M 8k Lc100K Mix1 Ep2<br>llm_load_print_meta: BOS token        = 1 '&lt;|im_start|&gt;'<br>llm_load_print_meta: EOS token        = 2 '&lt;|im_end|&gt;'<br>llm_load_print_meta: EOT token        = 0 '&lt;|endoftext|&gt;'<br>llm_load_print_meta: UNK token        = 0 '&lt;|endoftext|&gt;'<br>llm_load_print_meta: PAD token        = 2 '&lt;|im_end|&gt;'<br>llm_load_print_meta: LF token         = 143 '\u00c4'<br>llm_load_print_meta: EOG token        = 0 '&lt;|endoftext|&gt;'<br>llm_load_print_meta: EOG token        = 2 '&lt;|im_end|&gt;'<br>llm_load_print_meta: max token length = 162<br>llm_load_tensors: ggml ctx size =    0.14 MiB<br>llm_load_tensors:        CPU buffer size =   366.80 MiB<br>...............................................................................<br>llama_new_context_with_model: n_ctx      = 8192<br>llama_new_context_with_model: n_batch    = 2048<br>llama_new_context_with_model: n_ubatch   = 512<br>llama_new_context_with_model: flash_attn = 0<br>llama_new_context_with_model: freq_base  = 100000.0<br>llama_new_context_with_model: freq_scale = 1<br>llama_kv_cache_init:        CPU KV buffer size =   320.00 MiB<br>llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB<br>llama_new_context_with_model:        CPU  output buffer size =     0.19 MiB<br>ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 263.51 MiB<br>llama_new_context_with_model:        CPU compute buffer size =   263.51 MiB<br>llama_new_context_with_model: graph nodes  = 1030<br>llama_new_context_with_model: graph splits = 1<br>Enter query:<br>How are you?<br>I'm a text-based AI assistant. I don't have emotions or personal feelings, but I can understand and respond to your requests accordingly. If you have questions or need help with anything, feel free to ask.<br>Enter query:<br>Write a one line description on the C++ keyword 'new' <br>New C++ keyword represents memory allocation for dynamically allocated memory.<br>Enter query:<br>exit<\/pre>\n<h3>Conclusion<\/h3>\n<p>llama.cpp has simplified the deployment of large language models, making them accessible across a wide range of devices and use cases. By understanding its internals and building a simple C++ inference program, we have demonstrated how developers can leverage its low-level functions for performance-critical and resource-constrained applications. This tutorial not only serves as an introduction to llama.cpp\u2019s core constructs but also highlights its practicality in real-world projects, enabling efficient on-device interactions with\u00a0LLMs.<\/p>\n<p>For developers interested in pushing the boundaries of LLM deployment or those aiming to build robust applications, mastering tools like llama.cpp opens the door to immense possibilities. As you explore further, remember that this foundational knowledge can be extended to integrate advanced features, optimize performance, and adapt to evolving AI use\u00a0cases.<\/p>\n<p>I hope the tutorial was informative and left you fascinated by running LLMs in C++ directly. Do share your suggestions and questions in the comments below; they are always appreciated. Happy learning and have a wonderful day!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=12bc5f58505f\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/llama-cpp-writing-a-simple-c-inference-program-for-gguf-llm-models-12bc5f58505f\">llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shubham Panchal<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fllama-cpp-writing-a-simple-c-inference-program-for-gguf-llm-models-12bc5f58505f\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>llama.cpp: Writing A Simple C++ Inference Program for GGUF LLM Models Exploring llama.cpp internals and a basic chat program\u00a0flow Photo by Mathew Schwartz on\u00a0Unsplash llama.cpp has revolutionized the space of LLM inference by the means of wide adoption and simplicity. It has enabled enterprises and individual developers to deploy LLMs on devices ranging from SBCs [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1284,71,70,160,699],"tags":[1285,474,73],"class_list":["post-1168","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-cpp","category-large-language-models","category-machine-learning","category-programming","category-software-development","tag-cpp","tag-llama","tag-models"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1168"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1168"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1168\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1168"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1168"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1168"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}