{"id":2799,"date":"2025-04-02T07:03:10","date_gmt":"2025-04-02T07:03:10","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/02\/the-case-for-centralized-ai-model-inference-serving\/"},"modified":"2025-04-02T07:03:10","modified_gmt":"2025-04-02T07:03:10","slug":"the-case-for-centralized-ai-model-inference-serving","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/02\/the-case-for-centralized-ai-model-inference-serving\/","title":{"rendered":"The Case for Centralized AI Model Inference Serving"},"content":{"rendered":"<p>    The Case for Centralized AI Model Inference Serving<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\" id=\"44e8\"><mdspan datatext=\"el1743558270096\" class=\"mdspan-comment\">As AI <\/mdspan>models continue to increase in scope and accuracy, even tasks once dominated by traditional algorithms are gradually being replaced by <a href=\"https:\/\/towardsdatascience.com\/tag\/deep-learning\/\" title=\"Deep Learning\">Deep Learning<\/a> models. Algorithmic pipelines \u2014 workflows that take an input, process it through a series of algorithms, and produce an output \u2014 increasingly rely on one or more AI-based components. These AI models often have significantly different resource requirements than their classical counterparts, such as higher memory usage, reliance on specialized hardware accelerators, and increased computational demands.<\/p>\n<p class=\"wp-block-paragraph\" id=\"c29f\">In this post, we address a common challenge: efficiently processing large-scale inputs through algorithmic pipelines that include deep learning models. A typical solution is to run multiple independent jobs, each responsible for processing a single input. This setup is often managed with job orchestration frameworks (e.g.,\u00a0<a href=\"https:\/\/kubernetes.io\/\" rel=\"noreferrer noopener\" target=\"_blank\">Kubernetes<\/a>). However, when deep learning models are involved, this approach can become inefficient as loading and executing the same model in each individual process can lead to resource contention and scaling limitations. As AI models become increasingly prevalent in algorithmic pipelines, it is crucial that we revisit the design of such solutions.<\/p>\n<p class=\"wp-block-paragraph\" id=\"986b\">In this post we evaluate the benefits of centralized <a href=\"https:\/\/towardsdatascience.com\/tag\/inference\/\" title=\"Inference\">Inference<\/a> serving, where a dedicated inference server handles prediction requests from multiple parallel jobs. We define a toy experiment in which we run an image-processing pipeline based on a\u00a0<a href=\"https:\/\/pytorch.org\/vision\/0.20\/models\/generated\/torchvision.models.resnet152.html\" rel=\"noreferrer noopener\" target=\"_blank\">ResNet-152<\/a>\u00a0image classifier on 1,000 individual images. We compare the runtime performance and resource utilization of the following two implementations:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Decentralized inference<\/strong>\u00a0\u2014 each job loads and runs the model independently.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Centralized inference<\/strong>\u00a0\u2014 all jobs send inference requests to a dedicated inference server.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\" id=\"7d75\">To keep the experiment focused, we make several simplifying assumptions:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Instead of using a full-fledged job orchestrator (like\u00a0<a href=\"https:\/\/kubernetes.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kubernetes<\/a>), we implement parallel process execution using Python\u2019s multiprocessing module.<\/li>\n<li class=\"wp-block-list-item\">While real-world workloads often span multiple nodes, we run everything on a single node.<\/li>\n<li class=\"wp-block-list-item\">Real-world workloads typically include multiple algorithmic components. We limit our experiment to a single component \u2014 a ResNet-152 classifier running on a single input image.<\/li>\n<li class=\"wp-block-list-item\">In a real-world use case, each job would process a unique input image. To simplify our experiment setup, each job will process the same\u00a0<a href=\"https:\/\/github.com\/pytorch\/serve\/blob\/master\/examples\/image_classifier\/kitten.jpg\" target=\"_blank\" rel=\"noreferrer noopener\">kitty.jpg<\/a>\u00a0image.<\/li>\n<li class=\"wp-block-list-item\">We will use a minimal deployment of a\u00a0<a href=\"https:\/\/pytorch.org\/serve\/\" target=\"_blank\" rel=\"noreferrer noopener\">TorchServe<\/a>\u00a0inference server, relying mostly on its default settings. Similar results are expected with alternative inference server solutions such as\u00a0<a href=\"https:\/\/developer.nvidia.com\/triton-inference-server\" target=\"_blank\" rel=\"noreferrer noopener\">NVIDIA Triton Inference Server<\/a>\u00a0or\u00a0<a href=\"https:\/\/lightning.ai\/docs\/litserve\/home\" target=\"_blank\" rel=\"noreferrer noopener\">LitServe<\/a>.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\" id=\"3ad8\">The code is shared for demonstrative purposes only. Please do not interpret our choice of TorchServe \u2014 or any other component of our demonstration \u2014 as an endorsement of its use.<\/p>\n<h2 class=\"wp-block-heading\">Toy Experiment<\/h2>\n<p class=\"wp-block-paragraph\">We conduct our experiments on an\u00a0<a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/c5\/\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon EC2 c5.2xlarge<\/a>\u00a0instance, with 8 vCPUs and 16 GiB of memory, running a\u00a0<a href=\"https:\/\/aws.amazon.com\/releasenotes\/aws-deep-learning-ami-gpu-pytorch-2-6-ubuntu-22-04\/\" target=\"_blank\" rel=\"noreferrer noopener\">PyTorch Deep Learning AMI<\/a>\u00a0(DLAMI). We activate the PyTorch environment using the following command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">source \/opt\/pytorch\/bin\/activate<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"a47d\">Step 1: Creating a TorchScript Model Checkpoint<\/h3>\n<p class=\"wp-block-paragraph\" id=\"19b0\">We begin by creating a ResNet-152 model checkpoint. Using\u00a0<a href=\"https:\/\/pytorch.org\/docs\/stable\/jit.html\" rel=\"noreferrer noopener\" target=\"_blank\">TorchScript<\/a>, we serialize both the model definition and its weights into a single file:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import torch\nfrom torchvision.models import resnet152, ResNet152_Weights\n\nmodel = resnet152(weights=ResNet152_Weights.DEFAULT)\nmodel = torch.jit.script(model)\nmodel.save(\"resnet-152.pt\")<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"b5be\">Step 2: Model Inference Function<\/h3>\n<p class=\"wp-block-paragraph\" id=\"b5c2\">Our inference function performs the following steps:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Load the ResNet-152 model.<\/li>\n<li class=\"wp-block-list-item\">Load an input image.<\/li>\n<li class=\"wp-block-list-item\">Preprocess the image to match the input format expected by the model, following the implementation defined\u00a0<a href=\"https:\/\/github.com\/pytorch\/serve\/blob\/master\/ts\/torch_handler\/image_classifier.py\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/li>\n<li class=\"wp-block-list-item\">Run inference to classify the image.<\/li>\n<li class=\"wp-block-list-item\">Post-process the model output to return the top five label predictions, following the implementation defined\u00a0<a href=\"https:\/\/github.com\/pytorch\/serve\/blob\/master\/ts\/torch_handler\/image_classifier.py\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\" id=\"5dd4\">We define a constant MAX_THREADS hyperparameter that we use to restrict the number of threads used for model inference in each process. This is to prevent resource contention between the multiple jobs.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import os, time, psutil\nimport multiprocessing as mp\nimport torch\nimport torch.nn.functional as F\nimport torchvision.transforms as transforms\nfrom PIL import Image\n\n\ndef predict(image_id):\n    # Limit each process to 1 thread\n    MAX_THREADS = 1\n    os.environ[\"OMP_NUM_THREADS\"] = str(MAX_THREADS)\n    os.environ[\"MKL_NUM_THREADS\"] = str(MAX_THREADS)\n    torch.set_num_threads(MAX_THREADS)\n\n    # load the model\n    model = torch.jit.load('resnet-152.pt').eval()\n\n    # Define image preprocessing steps\n    transform = transforms.Compose([\n        transforms.Resize(256),\n        transforms.CenterCrop(224),\n        transforms.ToTensor(),\n        transforms.Normalize(mean=[0.485, 0.456, 0.406], \n                             std=[0.229, 0.224, 0.225])\n    ])\n\n    # load the image\n    image = Image.open('kitten.jpg').convert(\"RGB\")\n    \n    # preproc\n    image = transform(image).unsqueeze(0)\n\n    # perform inference\n    with torch.no_grad():\n        output = model(image)\n\n    # postproc\n    probabilities = F.softmax(output[0], dim=0)\n    probs, classes = torch.topk(probabilities, 5, dim=0)\n    probs = probs.tolist()\n    classes = classes.tolist()\n\n    return dict(zip(classes, probs))\n<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"382e\">Step 3: Running Parallel Inference Jobs<\/h3>\n<p class=\"wp-block-paragraph\" id=\"2e36\">We define a function that spawns parallel processes, each processing a single image input. This function:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Accepts the total number of images to process and the maximum number of concurrent jobs.<\/li>\n<li class=\"wp-block-list-item\">Dynamically launches new processes when slots become available.<\/li>\n<li class=\"wp-block-list-item\">Monitors CPU and memory usage throughout execution.<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def process_image(image_id):\n    print(f\"Processing image {image_id} (PID: {os.getpid()})\")\n    predict(image_id)\n\ndef spawn_jobs(total_images, max_concurrent):\n    start_time = time.time()\n    max_mem_utilization = 0.\n    max_utilization = 0.\n\n    processes = []\n    index = 0\n    while index &lt; total_images or processes:\n\n        while len(processes) &lt; max_concurrent and index &lt; total_images:\n            # Start a new process\n            p = mp.Process(target=process_image, args=(index,))\n            index += 1\n            p.start()\n            processes.append(p)\n\n        # sample memory utilization\n        mem_usage = psutil.virtual_memory().percent\n        max_mem_utilization = max(max_mem_utilization, mem_usage)\n        cpu_util = psutil.cpu_percent(interval=0.1)\n        max_utilization = max(max_utilization, cpu_util)\n\n        # Remove completed processes from list\n        processes = [p for p in processes if p.is_alive()]\n\n    total_time = time.time() - start_time\n    print(f\"nTotal Processing Time: {total_time:.2f} seconds\")\n    print(f\"Max CPU Utilization: {max_utilization:.2f}%\")\n    print(f\"Max Memory Utilization: {max_mem_utilization:.2f}%\")\n\nspawn_jobs(total_images=1000, max_concurrent=32)<\/code><\/pre>\n<h2 class=\"wp-block-heading\" id=\"4c74\">Estimating the Maximum Number of Processes<\/h2>\n<p class=\"wp-block-paragraph\" id=\"54e0\">While the optimal number of maximum concurrent processes is best determined empirically, we can estimate an upper bound based on the 16 GiB of system memory and the size of the resnet-152.pt file, 231 MB.<\/p>\n<p class=\"wp-block-paragraph\" id=\"2ab7\">The table below summarizes the runtime results for several configurations:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1muTsRotRMCuLoHOGk6DYFw.png?ssl=1\" alt=\"\" class=\"wp-image-600964\"><figcaption class=\"wp-element-caption\"><strong>Decentralized Inference Results (by Author)<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"3ec4\">Although memory becomes fully saturated at 50 concurrent processes, we observe that maximum throughput is achieved at 8 concurrent jobs\u200a\u2014\u200aone per vCPU. This indicates that beyond this point, resource contention outweighs any potential gains from additional parallelism.<\/p>\n<h2 class=\"wp-block-heading\" id=\"21c0\">The Inefficiencies of Independent Model Execution<\/h2>\n<p class=\"wp-block-paragraph\" id=\"0b62\">Running parallel jobs that each load and execute the model independently introduces significant inefficiencies and waste:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Each process needs to allocate the appropriate memory resources for storing its own copy of the AI model.<\/li>\n<li class=\"wp-block-list-item\">AI models are compute-intensive. Executing them in many processes in parallel can lead to resource contention and reduced throughput.<\/li>\n<li class=\"wp-block-list-item\">Loading the model checkpoint file and initializing the model in each process adds overhead and can further increase latency. In the case of our toy experiment, model initialization makes up for roughly 30%(!!) of the overall inference processing time.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\" id=\"4929\">A more efficient alternative is to centralize inference execution using a dedicated model inference server. This approach would eliminate redundant model loading and reduce overall system resource utilization.<\/p>\n<p class=\"wp-block-paragraph\" id=\"e12d\">In the next section we will set up an AI model inference server and assess its impact on resource utilization and runtime performance.<\/p>\n<p class=\"wp-block-paragraph\" id=\"5573\"><strong>Note:<\/strong>\u00a0We could have modified our multiprocessing-based approach to share a single model across processes (e.g., using\u00a0<a href=\"https:\/\/pytorch.org\/docs\/stable\/multiprocessing.html\" rel=\"noreferrer noopener\" target=\"_blank\">torch.multiprocessing<\/a>\u00a0or another solution based on\u00a0<a href=\"https:\/\/docs.python.org\/3\/library\/multiprocessing.shared_memory.html\" rel=\"noreferrer noopener\" target=\"_blank\">shared memory<\/a>). However, the inference server demonstration better aligns with real-world production environments, where jobs often run in isolated containers.<\/p>\n<h2 class=\"wp-block-heading\" id=\"c370\">TorchServe Setup<\/h2>\n<p class=\"wp-block-paragraph\" id=\"4d84\">The TorchServe setup described in this section loosely follows the\u00a0<a href=\"https:\/\/github.com\/pytorch\/serve\/tree\/master\/examples\/image_classifier\/resnet_18\" rel=\"noreferrer noopener\" target=\"_blank\">resnet tutorial<\/a>. Please refer to the official\u00a0<a href=\"https:\/\/pytorch.org\/serve\/\" rel=\"noreferrer noopener\" target=\"_blank\">TorchServe<\/a>\u00a0documentation for more in-depth guidelines.<\/p>\n<h3 class=\"wp-block-heading\" id=\"a587\">Installation<\/h3>\n<p class=\"wp-block-paragraph\" id=\"dc64\">The PyTorch environment of our DLAMI comes preinstalled with TorchServe executables. If you are running in a different environment run the following installation command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">pip install torchserve torch-model-archiver<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"abde\">Creating a Model Archive<\/h3>\n<p class=\"wp-block-paragraph\" id=\"ce5e\">The TorchServe Model Archiver packages the model and its associated files into a \u201c<em>.mar<\/em>\u201d file archive, the format required for deployment on TorchServe. We create a TorchServe model archive file based on our model checkpoint file and using the\u00a0<a href=\"https:\/\/pytorch.org\/serve\/default_handlers.html\" rel=\"noreferrer noopener\" target=\"_blank\">default image_classifier handler<\/a>:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">mkdir model_store\ntorch-model-archiver \n    --model-name resnet-152 \n    --serialized-file resnet-152.pt \n    --handler image_classifier \n    --version 1.0 \n    --export-path model_store<\/code><\/pre>\n<h3 class=\"wp-block-heading\">TorchServe Configuration<\/h3>\n<p class=\"wp-block-paragraph\">We create a TorchServe config.properties file to define how TorchServe should operate:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model_store=model_store\nload_models=resnet-152.mar\nmodels={\n  \"resnet-152\": {\n    \"1.0\": {\n        \"marName\": \"resnet-152.mar\"\n    }\n  }\n}\n\n# Number of workers per model\ndefault_workers_per_model=1\n\n# Job queue size (default is 100)\njob_queue_size=100<\/code><\/pre>\n<p class=\"wp-block-paragraph\">After completing these steps, our working directory should look like this:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">\u251c\u2500\u2500 config.properties\n\u05ab\u251c\u2500\u2500 kitten.jpg\n\u251c\u2500\u2500 model_store\n\u2502   \u251c\u2500\u2500 resnet-152.mar\n\u251c\u2500\u2500 multi_job.py<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Starting TorchServe<\/h3>\n<p class=\"wp-block-paragraph\">In a separate shell we start our TorchServe inference server:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-batch\">source \/opt\/pytorch\/bin\/activate\ntorchserve \n    --start \n    --disable-token-auth \n    --enable-model-api \n    --ts-config config.properties<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Inference Request Implementation<\/h3>\n<p class=\"wp-block-paragraph\">We define an alternative prediction function that calls our inference service:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import requests\n\ndef predict_client(image_id):\n    with open('kitten.jpg', 'rb') as f:\n        image = f.read()\n    response = requests.post(\n        \"http:\/\/127.0.0.1:8080\/predictions\/resnet-152\",\n        data=image,\n        headers={'Content-Type': 'application\/octet-stream'}\n    )\n\n    if response.status_code == 200:\n        return response.json()\n    else:\n        print(f\"Error from inference server: {response.text}\")<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Scaling Up the Number of Concurrent Jobs<\/h3>\n<p class=\"wp-block-paragraph\">Now that inference requests are being processed by a central server, we can scale up parallel processing. Unlike the earlier approach where each process loaded and executed its own model, we have sufficient CPU resources to allow for many more concurrent processes. Here we choose 100 processes in accordance with the default\u00a0<em>job_queue_size\u00a0<\/em>capacity of the inference server:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">spawn_jobs(total_images=1000, max_concurrent=100)<\/code><\/pre>\n<h3 class=\"wp-block-heading\">Results<\/h3>\n<p class=\"wp-block-paragraph\" id=\"7a30\">The performance results are captured in the table below. Keep in mind that the comparative results can vary greatly based on the details of the AI model and the runtime environment.<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1oNoZtyFhBvv7TPdqtZ19Dg.png?ssl=1\" alt=\"\" class=\"wp-image-600962\"><figcaption class=\"wp-element-caption\"><strong>Inference Server Results (by Author)<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"216e\">By using a centralized inference server, not only have we have increased overall throughput by more than 2X, but we have freed significant CPU resources for other computation tasks.<\/p>\n<h2 class=\"wp-block-heading\" id=\"b32a\">Next Steps<\/h2>\n<p class=\"wp-block-paragraph\" id=\"3294\">Now that we have effectively demonstrated the benefits of a centralized inference serving solution, we can explore several ways to enhance and optimize the setup. Recall that our experiment was intentionally simplified to focus on demonstrating the utility of inference serving. In real-world deployments, additional enhancements may be required to tailor the solution to your specific needs.<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Custom Inference Handlers<\/strong>: While we used TorchServe\u2019s built-in\u00a0<a href=\"https:\/\/pytorch.org\/serve\/default_handlers.html#image-classifier\" target=\"_blank\" rel=\"noreferrer noopener\">image_classifier<\/a>\u00a0handler, defining a\u00a0<a href=\"https:\/\/pytorch.org\/serve\/custom_service.html#custom-handlers\" target=\"_blank\" rel=\"noreferrer noopener\">custom handler<\/a>\u00a0provides much greater control over the details of the inference implementation.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Advanced Inference Server Configuration<\/strong>: Inference server solutions will typically include many features for tuning the service behavior according to the workload requirements. In the next sections we will explore some of the features supported by TorchServe.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Expanding the Pipeline<\/strong>: Real world models will typically include more algorithm blocks and more sophisticated AI models than we used in our experiment.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Multi-Node Deployment<\/strong>: While we ran our experiments on a single compute instance, production setups will typically include multiple nodes.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Alternative Inference Servers<\/strong>: While TorchServe is a popular choice and relatively easy to set up, there are many alternative inference server solutions that may provide additional benefits and may better suit your needs. Importantly, it was recently announced that TorchServe would no longer be actively maintained. See the\u00a0<a href=\"https:\/\/pytorch.org\/serve\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">documentation<\/a>\u00a0for details.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Alternative Orchestration Frameworks<\/strong>: In our experiment we use Python multiprocessing. Real-world workloads will typically use more advanced orchestration solutions.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Utilizing Inference Accelerators<\/strong>: While we executed our model on a CPU, using an AI accelerator (e.g., an NVIDIA GPU, a Google Cloud TPU, or an AWS Inferentia) can drastically improve throughput.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Model <a href=\"https:\/\/towardsdatascience.com\/tag\/optimization\/\" title=\"Optimization\">Optimization<\/a><\/strong>:\u00a0<a href=\"https:\/\/pytorch.org\/serve\/performance_checklist.html\" target=\"_blank\" rel=\"noreferrer noopener\">Optimizing<\/a>\u00a0your AI models can greatly increase efficiency and throughput.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Auto-Scaling for Inference Load<\/strong>: In some use cases inference traffic will fluctuate, requiring an inference server solution that can scale its capacity accordingly.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\" id=\"b833\">In the next sections we explore two simple ways to enhance our TorchServe-based inference server implementation. We leave the discussion on other enhancements to future posts.<\/p>\n<h2 class=\"wp-block-heading\" id=\"5774\">Batch Inference with TorchServe<\/h2>\n<p class=\"wp-block-paragraph\" id=\"b0bd\">Many model inference service solutions support the option of grouping inference requests into batches. This usually results in increased throughput, especially when the model is running on a GPU.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d11d\">We extend our TorchServe\u00a0<em>config.properties<\/em>\u00a0file to support batch inference with a batch size of up to 8 samples. Please see the\u00a0<a href=\"https:\/\/pytorch.org\/serve\/batch_inference_with_ts.html\" rel=\"noreferrer noopener\" target=\"_blank\">official documentation<\/a>\u00a0for details on batch inference with TorchServe.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model_store=model_store\nload_models=resnet-152.mar\nmodels={\n  \"resnet-152\": {\n    \"1.0\": {\n        \"marName\": \"resnet-152.mar\",\n        \"batchSize\": 8,\n        \"maxBatchDelay\": 100,\n        \"responseTimeout\": 200\n    }\n  }\n}\n\n# Number of workers per model\ndefault_workers_per_model=1\n\n# Job queue size (default is 100)\njob_queue_size=100<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"72c6\">Results<\/h3>\n<p class=\"wp-block-paragraph\" id=\"3ea9\">We append the results in the table below:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1xGheBKZ2wFDOi0BgSroiWQ.png?ssl=1\" alt=\"\" class=\"wp-image-600961\"><figcaption class=\"wp-element-caption\"><strong>Batch Inference Server Results (by Author)<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"7994\">Enabling batched inference increases the throughput by an additional 26.5%.<\/p>\n<h2 class=\"wp-block-heading\" id=\"1b73\">Multi-Worker Inference with TorchServe<\/h2>\n<p class=\"wp-block-paragraph\" id=\"246b\">Many model inference service solutions will support creating multiple inference workers for each AI model. This enables fine-tuning the number of inference workers based on expected load. Some solutions support auto-scaling of the number of inference workers.<\/p>\n<p class=\"wp-block-paragraph\" id=\"a480\">We extend our own TorchServe setup by increasing the\u00a0<code>default_workers_per_model<\/code>\u00a0setting that controls the number of inference workers assigned to our image classification model.<\/p>\n<p class=\"wp-block-paragraph\" id=\"138f\">Importantly, we must limit the number of threads allocated to each worker to prevent resource contention. This is controlled by the\u00a0<code>number_of_netty_threads<\/code><em>\u00a0<\/em>setting and by the\u00a0<code>OMP_NUM_THREADS<\/code>\u00a0and\u00a0<code>MKL_NUM_THREADS<\/code>\u00a0environment variables. Here we have set the number of threads to equal the number of vCPUs (8) divided by the number of workers.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">model_store=model_store\nload_models=resnet-152.mar\nmodels={\n  \"resnet-152\": {\n    \"1.0\": {\n        \"marName\": \"resnet-152.mar\"\n        \"batchSize\": 8,\n        \"maxBatchDelay\": 100,\n        \"responseTimeout\": 200\n    }\n  }\n}\n\n# Number of workers per model\ndefault_workers_per_model=2 \n\n# Job queue size (default is 100)\njob_queue_size=100\n\n# Number of threads per worker\nnumber_of_netty_threads=4<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The modified TorchServe startup sequence appears below:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">export OMP_NUM_THREADS=4\nexport MKL_NUM_THREADS=4\ntorchserve \n    --start \n    --disable-token-auth \n    --enable-model-api \n    --ts-config config.properties<\/code><\/pre>\n<h3 class=\"wp-block-heading\" id=\"5be3\">Results<\/h3>\n<p class=\"wp-block-paragraph\" id=\"5c4a\">In the table below we append the results of running with 2, 4, and 8 inference workers:<\/p>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1ZRCedYA53M_YZLpbIBRiig.png?ssl=1\" alt=\"\" class=\"wp-image-600963\"><figcaption class=\"wp-element-caption\"><strong>Multi-Worker Inference Server Results (by Author)<\/strong><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\" id=\"62f0\">By configuring TorchServe to use multiple inference workers, we are able to increase the throughput by an additional 36%. This amounts to a 3.75X improvement over the baseline experiment.<\/p>\n<h2 class=\"wp-block-heading\" id=\"9c32\">Summary<\/h2>\n<p class=\"wp-block-paragraph\" id=\"2a46\">This experiment highlights the potential impact of inference server deployment on multi-job deep learning workloads. Our findings suggest that using an inference server can improve system resource utilization, enable higher concurrency, and significantly increase overall throughput. Keep in mind that the precise benefits will greatly depend on the details of the workload and the runtime environment.<\/p>\n<p class=\"wp-block-paragraph\" id=\"d4e6\">Designing the inference serving architecture is just one part of optimizing AI model execution. Please see some of our\u00a0<a href=\"https:\/\/towardsdatascience.com\/author\/chaimrand\/\" rel=\"noreferrer noopener\" target=\"_blank\">many posts<\/a>\u00a0covering a wide range AI model optimization techniques.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/the-case-for-centralized-ai-model-inference-serving\/\">The Case for Centralized AI Model Inference Serving<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Chaim Rand<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/the-case-for-centralized-ai-model-inference-serving\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Case for Centralized AI Model Inference Serving As AI models continue to increase in scope and accuracy, even tasks once dominated by traditional algorithms are gradually being replaced by Deep Learning models. Algorithmic pipelines \u2014 workflows that take an input, process it through a series of algorithms, and produce an output \u2014 increasingly rely [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,88,513,70,402,75],"tags":[98,193,73],"class_list":["post-2799","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-deep-learning","category-inference","category-machine-learning","category-optimization","category-pytorch","tag-ai","tag-inference","tag-models"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2799"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2799"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2799\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2799"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2799"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2799"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}