{"id":35198,"date":"2025-03-20T13:41:02","date_gmt":"2025-03-20T20:41:02","guid":{"rendered":"https:\/\/www.pugetsystems.com\/?post_type=hpc_post&#038;p=35198"},"modified":"2025-03-20T13:41:04","modified_gmt":"2025-03-20T20:41:04","slug":"exploring-hybrid-cpu-gpu-llm-inference","status":"publish","type":"hpc_post","link":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/","title":{"rendered":"Exploring Hybrid CPU\/GPU LLM Inference"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-transparent ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Introduction\" >Introduction<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Test_Setup\" >Test Setup<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Thoughts_on_FP4_Training\" >Thoughts on FP4 Training<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Hybrid_CPUGPU_Inference_Results\" >Hybrid CPU\/GPU Inference Results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Performance_Considerations\" >Performance Considerations<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\"><span class=\"ez-toc-section\" id=\"Introduction\"><\/span>Introduction<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>With releases of very large, capable, open-weight LLMs such as Meta\u2019s Llama-3.1-405B and the more recent DeepSeek models, V3 and R1, there is increasing demand for systems capable of performing inference with these models. However, the size of these models puts pure GPU inference out of reach for many organizations and individual users. For example, even a system with eight 80GB GPUs is not capable of running Llama-405B in its native precision of BF16, and two nodes are required. Fully acknowledging the challenges of such a deployment, Meta released a version of the model quantized to FP8, capable of running on a single node equipped with eight 80GB GPUs. Even so, few can claim to have local access to such a system, even here at Puget Systems!<\/p>\n\n\n\n<p>So what is a mere mortal to do? Renting cloud GPU time is one option and is often a great choice for short-term testing. However, it may or may not be a feasible solution in production due to costs or restrictions on transferring sensitive data. For a purely local solution, one option worth considering is a hybrid approach to inference, where system RAM is substituted for VRAM. Because system RAM can be expected to have a much lower bandwidth than VRAM, this comes with a sizeable impact on performance, but the fact that RAM is considerably less expensive to obtain means that this could be an attractive option. But that raises the question: if you were to go with this approach, just how much slower would it be compared to using an expensive multi-GPU setup?<\/p>\n\n\n\n<div class=\"mod-img wp-block-image aligncenter\" data-target=\"single-image-modal-82582\">\n<figure class=\"aligncenter\">\n\t<!-- If image is not empty, print image, else, print from image URL -->\n\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\" alt=\"\" ><\/img>\n\t\t<\/figure>\n<\/div>\n\n<!-- Displays caption if entered -->\n\n<!-- Displays modal upon click of an image -->\n<script type=\"text\/javascript\">\n\t\/\/Using unique random generated id\n\tjQuery(document).ready(function(){\n\t\tjQuery('[data-target=\"single-image-modal-82582\"]').click(function(){\n            jQuery('#single-image-modal-82582Modal').modal('show');\n\t\t\t});\n        });\n<\/script>\n\n<div class=\"modal fade popup-image\" id=\"single-image-modal-82582Modal\" tabindex=\"-1\" role=\"dialog\"> \n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div> <!-- \/modal-header -->\n\n            <div class=\"modal-body inner-modal\">\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\" alt=\"\" \/>\n\t\t\t\t\t\t<div class=\"text-center full-res-image-wrapper\">\n\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg full-res-image-link\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div><!-- \/modal-body -->\n    \t<\/div><!-- \/modal-content -->\n    <\/div><!-- \/modal-dialog -->\n<\/div><!-- \/modal fade --> \n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-test-setup\"><span class=\"ez-toc-section\" id=\"Test_Setup\"><\/span>Test Setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Thanks to our R&amp;D team, I was able to perform some testing on one of our <a href=\"https:\/\/www.pugetsystems.com\/workstations\/epyc\/e210-xl\/\">Puget Systems AMD EPYC Workstations<\/a>, which is equipped with enough system RAM to load even the largest publicly available models.<\/p>\n\n\n\n<p>Motherboard: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Motherboard\/Gigabyte-MZ73-LM0-Rev-2-0-15236\/\">Gigabyte MZ73-LM0 Rev 2.0<\/a><br>CPU: 2x <a href=\"https:\/\/www.pugetsystems.com\/parts\/CPU\/AMD-EPYC-9554-3-1GHz-64-Core-360W-15079\/\">AMD EPYC 9554 64-core Processors<\/a><br>RAM: 24x <a href=\"https:\/\/www.pugetsystems.com\/parts\/Ram\/Kingston-DDR5-5600-ECC-Reg-2R-64GB-14954\/\">Kingston DDR5-5600 ECC Reg. 2R 64GB<\/a><br>GPU: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-RTX-6000-Ada-Generation-48GB-PCI-E-14611\/\">NVIDIA RTX 6000 Ada Generation 48GB<\/a><\/p>\n\n\n\n<p>For the model, I chose to test with Unsloth\u2019s Q4_K_M quantization of DeepSeek-R1. Another exciting option provided by Unsloth is their sub-4-bit<a href=\"https:\/\/unsloth.ai\/blog\/deepseekr1-dynamic\"> dynamic quantizations<\/a>. These models\u2019 size on disk ranges from 131GB (1.58-bit) to 212GB (2.51-bit) while offering improved output quality compared to statically quantized versions of the model. Compared to the 700GB of the unquantized model, these quantizations allow DeepSeek-R1 to be run on a much wider variety of hardware without significantly sacrificing the model\u2019s quality.<br>For the inference software, I chose to test with <a href=\"https:\/\/github.com\/kvcache-ai\/ktransformers\">KTransformers<\/a>, which advertises that it performs up to 28 times faster in the prefill phase and 3 times faster in the decode phase compared to llama.cpp. The exact performance gains depend on different factors like software versions and hardware used (e.g. KTransformers v0.3 includes optimizations for Intel Xeon AMX instructions). However, I did perform a few brief tests with llama.cpp and confirmed that KTransformers was significantly faster with the DeepSeek model chosen for this testing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-thoughts-on-fp4-training\"><span class=\"ez-toc-section\" id=\"Thoughts_on_FP4_Training\"><\/span>Thoughts on FP4 Training<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>With the introduction of FP4 support on NVIDIA\u2019s latest \u2018Blackwell\u2019 GPU offerings, it\u2019s worth looking back to the previous \u201cAda Lovelace\u201d generation and remembering that the Ada generation of GPUs introduced support for FP8. The Ada generation launched back in October of 2022, and researchers immediately jumped into investigating how FP8 capabilities could be utilized to make training LLMs faster and more efficient.<\/p>\n\n\n\n<p>Unlike most models, DeepSeek-V3 (which served as the base model for creating R1) was trained with a mixed precision framework using the FP8 data format. This means that V3 &amp; R1\u2019s native precision is FP8 rather than the more common FP\/BF16. DeepSeek-V3 wasn\u2019t released until the very end of December 2024. I\u2019m not certain whether DeepSeek-V3 was the first publicly-released model trained using the newly supported FP8 data format, but it\u2019s clear that it\u2019s had the biggest impact. Assuming a similar timeline for FP4 training, we shouldn\u2019t expect to see models with a native precision of FP4 until 2026.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-hybrid-cpu-gpu-inference-results\"><span class=\"ez-toc-section\" id=\"Hybrid_CPUGPU_Inference_Results\"><\/span>Hybrid CPU\/GPU Inference Results<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>I was only able to scratch the surface of the amount of testing I would ideally like to run, I do have some results to share from my time with this system. I used a series of four prompts of increasing size, ~1000 tokens, ~7500 tokens, and the final two consisting of ~16,000 tokens. Two series of tests were run using these prompts, one with 126 CPU threads utilized and one with 254 CPU threads.&nbsp;<\/p>\n\n\n\n<p>Before diving into the data, it\u2019s worth noting that due to variations in the token count of the output, the duration of the decode phase is less directly comparable between tests compared to the prefill phase.&nbsp;<\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>126 threads<\/th><th>Prefill Performance (Tokens per second)<\/th><th>Decode Performance (Tokens per second)<\/th><th>Prefill Time (seconds)<\/th><th>Decode Time (seconds)<\/th><\/tr><\/thead><tbody><tr><td>Prompt 1<\/td><td>N\/A<\/td><td>13.71<\/td><td>N\/A<\/td><td>64.79<\/td><\/tr><tr><td>Prompt 2<\/td><td>152.19<\/td><td>12.2<\/td><td>50.65<\/td><td>94.93<\/td><\/tr><tr><td>Prompt 3<\/td><td>146.46<\/td><td>10.73<\/td><td>107.59<\/td><td>172.99<\/td><\/tr><tr><td>Prompt 4<\/td><td>157.94<\/td><td>9.99<\/td><td>107.05<\/td><td>111.20<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Unfortunately, I found that I had recorded an anomalous result for the first prompt during the 126-thread test, which I have omitted.<\/p>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>254 threads<\/th><th>Prefill Performance (Tokens per second)<\/th><th>Decode Performance (Tokens per second)<\/th><th>Prefill Time (seconds)<\/th><th>Decode Time (seconds)<\/th><\/tr><\/thead><tbody><tr><td>Prompt 1<\/td><td>172.74<\/td><td>8.42<\/td><td>5.53<\/td><td>116.24<\/td><\/tr><tr><td>Prompt 2<\/td><td>150.43<\/td><td>7.39<\/td><td>51.24<\/td><td>126.55<\/td><\/tr><tr><td>Prompt 3<\/td><td>125.89<\/td><td>7.96<\/td><td>125.03<\/td><td>204.99<\/td><\/tr><tr><td>Prompt 4<\/td><td>126.49<\/td><td>7.89<\/td><td>134.00<\/td><td>160.91<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>When comparing the two series of tests, the immediate difference that stands out is that the 126-thread configuration consistently outperforms the 254-thread configuration during token generation. This makes sense considering that memory bandwidth is the limiting factor during the decoding phase. By using both CPUs, we are forcing the use of the interconnect between the CPUs, which not only limits memory bandwidth but also introduces additional latency and communication overhead.<\/p>\n\n\n\n<p>By enabling the use of more CPU threads, we see an almost 40% drop in the token generation speed when testing with the smallest prompt, and that performance gap shrinks to around 20-25% as the prompt size increases. But considering that we are almost certainly using more energy to achieve this drop in performance, the efficiency of the second configuration is abysmal and makes me regret not recording the system\u2019s electricity consumption for comparison.<\/p>\n\n\n\n<p>We also see that the prefill stage performance is more consistent when using fewer threads as well, staying at roughly 150 tokens per second. Although the computation during this phase is primarily being performed by the GPU, it seems to be affected by the increased utilization of the CPU interconnect as well since the performance drops down to about 125 tokens per second when longer prompts are used in conjunction with the 254-thread configuration.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-performance-considerations\"><span class=\"ez-toc-section\" id=\"Performance_Considerations\"><\/span>Performance Considerations<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Regardless of the specifics of the software configuration, we can see that the end-user experience is considerably different than what we may be used to. Compared to pure-GPU inference that offers a largely real-time experience, depending on the size of the prompt and the desired length of the response, <strong>we can expect to spend several minutes waiting for a response to complete.<\/strong> Depending on the use-case, this may or may not be an acceptable trade-off for the ability to run models that are too large to run entirely within VRAM.<\/p>\n\n\n\n<p>Based on my limited testing, the best approach to selecting hardware for hybrid inference seems to involve selecting a single-socket motherboard with as many RAM channels as possible to maximize bandwidth. Although dual-socket motherboards offer higher CPU thread counts and total RAM capacities, the limitations of utilizing the CPU interconnect translate into decreased performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Cost of hardware is a major barrier for the deployment of larger LLMs available today. Very large models like DeepSeek-R1 or Llama-3.1-405B can potentially require hundreds of thousands of dollars worth of hardware to run. Hybrid inference utilizing CPUs and system memory in addition to GPUs and VRAM offers a more affordable alternative to pure GPU inference, albeit with slower performance.&nbsp;<\/p>\n\n\n\n<p>A hybrid approach will never be as fast as a pure GPU one (either in the cloud or on a local server), and can require several minutes to complete a prompt. However, for those who are not able to afford a multi-hundred-thousand dollar server and are able to sacrifice inference speed for the improved quality of output that larger models offer, a hybrid inference solution (based on something like our <a href=\"https:\/\/www.pugetsystems.com\/workstations\/servers\/e140-2u\/\">AMD EPYC Servers<\/a>) is worth consideration.<\/p>\n\n\n<div class=\"wp-bootstrap-blocks-row row puget-icon-section\">\n\t\n\n<div class=\"col-12 col-md-6\">\n\t\t\t\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-thumbnail is-resized text-center\"><a href=\"https:\/\/www.pugetsystems.com\/solutions\/scientific-computing-workstations\/machine-learning-ai\/\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/12\/computer-icon.png\" alt=\"Tower Computer Icon in Puget Systems Colors\" class=\"wp-image-12659\" style=\"width:113px;height:113px\" title=\"\"\/><\/a><\/figure>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"h-looking-for-an-ai-workstation-or-server\">Looking for an AI workstation or server?<\/h4>\n\n\n\n<p class=\"has-text-align-center\">We build computers tailor-made for your workflow.&nbsp;<\/p>\n\n\n<div class=\"wp-bootstrap-blocks-button text-center\">\n\t<a\n\t\thref=\"https:\/\/www.pugetsystems.com\/solutions\/scientific-computing-workstations\/machine-learning-ai\/\"\n\t\t\t\t\t\tclass=\"btn btn-primary\"\n\t>\n\t\tConfigure a System\t<\/a>\n<\/div>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\t<\/div>\n\n\n\n<div class=\"col-12 col-md-6\">\n\t\t\t\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-thumbnail is-resized text-center\"><a href=\"\/contact-expert\/\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/talking-icon.png\" alt=\"Talking Head Icon in Puget Systems Colors\" class=\"wp-image-12657\" style=\"width:113px;height:113px\" title=\"\"\/><\/a><\/figure>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"h-don-t-know-where-to-start-we-can-help\">Don&#8217;t know where to start?<br>We can help!<\/h4>\n\n\n\n<p class=\"has-text-align-center\">Get in touch with one of our technical consultants today.<\/p>\n\n\n<div class=\"wp-bootstrap-blocks-button text-center\">\n\t<a\n\t\thref=\"\/contact-expert\/\"\n\t\t\t\t\t\tclass=\"btn btn-primary\"\n\t>\n\t\tTalk to an Expert\t<\/a>\n<\/div>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\t<\/div>\n\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"h-related-content\">Related Content<\/h3>\n\n\n \n<div class=\"related-content\">\n\t<ul class=\"related-content-list\">\n\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/standing-up-ai-development-quickly-for-supercomputing-2025\/\" title=\"Standing Up AI Development Quickly for Supercomputing 2025\">Standing Up AI Development Quickly for Supercomputing 2025<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" title=\"Exploring Hybrid CPU\/GPU LLM Inference\">Exploring Hybrid CPU\/GPU LLM Inference<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/whats-the-deal-with-npus\/\" title=\"What&#8217;s the deal with NPUs?\">What&#8217;s the deal with NPUs?<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/local-alternatives-to-cloud-ai-services\/\" title=\"Local alternatives to Cloud AI services\">Local alternatives to Cloud AI services<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t<\/ul>\n\t \n\t<a class=\"view-term-link\" href=\"\/all_articles?filter=machine-learning\">View\n\t\tAll Related Content<\/a>\n\t<\/div><\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"h-latest-content\">Latest Content<\/h3>\n\n\n \n<div class=\"latest-content\">\n\t<ul class=\"latest-content-list\">\n\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/standing-up-ai-development-quickly-for-supercomputing-2025\/\" title=\"Standing Up AI Development Quickly for Supercomputing 2025\">Standing Up AI Development Quickly for Supercomputing 2025<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" title=\"Exploring Hybrid CPU\/GPU LLM Inference\">Exploring Hybrid CPU\/GPU LLM Inference<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/whats-the-deal-with-npus\/\" title=\"What&#8217;s the deal with NPUs?\">What&#8217;s the deal with NPUs?<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/local-alternatives-to-cloud-ai-services\/\" title=\"Local alternatives to Cloud AI services\">Local alternatives to Cloud AI services<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t<\/ul>\n\t \n\t\t<a href=\"https:\/\/www.pugetsystems.com\/all-hpc\/\" class=\"view-posts-link\">View All<\/a>\n\t<\/div><\/div>\n<\/div>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.<\/p>\n","protected":false},"author":166,"featured_media":35203,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","meta":{"_acf_changed":false,"content-type":"","classic-editor-remember":"","legacy_id":"","redirect_url":[],"expire_date":"","alert_message":"","alert_link":[],"configure_ids":"","system_grid_title":"","system_grid_ids":"","footnotes":""},"hpc_categories":[8879,8883,8885],"hpc_tags":[8765,8768,8770],"coauthors":[9063],"class_list":["post-35198","hpc_post","type-hpc_post","status-publish","has-post-thumbnail","hentry","hpc_category-hardware","hpc_category-machine-learning","hpc_category-software","hpc_tag-machine-learning","hpc_tag-microsoft","hpc_tag-ml-ai"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.7 (Yoast SEO v26.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Exploring Hybrid CPU\/GPU LLM Inference | Puget Systems<\/title>\n<meta name=\"description\" content=\"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Exploring Hybrid CPU\/GPU LLM Inference\" \/>\n<meta property=\"og:description\" content=\"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" \/>\n<meta property=\"og:site_name\" content=\"Puget Systems\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/PugetSystems\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-20T20:41:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/LLM-Inference-on-CPU.png?wsr\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@PugetSystems\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"7 minutes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data2\" content=\"Jon Allman\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\",\"url\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\",\"name\":\"Exploring Hybrid CPU\/GPU LLM Inference | Puget Systems\",\"isPartOf\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\",\"datePublished\":\"2025-03-20T20:41:02+00:00\",\"dateModified\":\"2025-03-20T20:41:04+00:00\",\"description\":\"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage\",\"url\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\",\"contentUrl\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png\",\"width\":1920,\"height\":1080},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.pugetsystems.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"HPC Posts\",\"item\":\"https:\/\/www.pugetsystems.com\/all-hpc\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Exploring Hybrid CPU\/GPU LLM Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.pugetsystems.com\/#website\",\"url\":\"https:\/\/www.pugetsystems.com\/\",\"name\":\"Puget Systems\",\"description\":\"Workstations for creators.\",\"publisher\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.pugetsystems.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.pugetsystems.com\/#organization\",\"name\":\"Puget Systems\",\"url\":\"https:\/\/www.pugetsystems.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png\",\"contentUrl\":\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png\",\"width\":2560,\"height\":363,\"caption\":\"Puget Systems\"},\"image\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/PugetSystems\",\"https:\/\/x.com\/PugetSystems\",\"https:\/\/www.instagram.com\/pugetsystems\/\",\"https:\/\/www.linkedin.com\/company\/puget-systems\",\"https:\/\/www.youtube.com\/user\/pugetsys\",\"https:\/\/en.wikipedia.org\/wiki\/Puget_Systems\"],\"telephone\":\"(425) 458-0273\",\"legalName\":\"Puget Sound Systems, Inc.\",\"foundingDate\":\"2000-12-01\",\"duns\":\"128267585\",\"naics\":\"334111\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Exploring Hybrid CPU\/GPU LLM Inference | Puget Systems","description":"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/","og_locale":"en_US","og_type":"article","og_title":"Exploring Hybrid CPU\/GPU LLM Inference","og_description":"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.","og_url":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/","og_site_name":"Puget Systems","article_publisher":"https:\/\/www.facebook.com\/PugetSystems","article_modified_time":"2025-03-20T20:41:04+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/LLM-Inference-on-CPU.png?wsr","type":"image\/png"}],"twitter_card":"summary_large_image","twitter_site":"@PugetSystems","twitter_misc":{"Est. reading time":"7 minutes","Written by":"Jon Allman"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/","url":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/","name":"Exploring Hybrid CPU\/GPU LLM Inference | Puget Systems","isPartOf":{"@id":"https:\/\/www.pugetsystems.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage"},"image":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage"},"thumbnailUrl":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png","datePublished":"2025-03-20T20:41:02+00:00","dateModified":"2025-03-20T20:41:04+00:00","description":"A brief look into using a hybrid GPU\/VRAM + CPU\/RAM approach to LLM inference with the KTransformers inference library.","breadcrumb":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#primaryimage","url":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png","contentUrl":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/LLM-Inference-on-CPU.png","width":1920,"height":1080},{"@type":"BreadcrumbList","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pugetsystems.com\/"},{"@type":"ListItem","position":2,"name":"HPC Posts","item":"https:\/\/www.pugetsystems.com\/all-hpc\/"},{"@type":"ListItem","position":3,"name":"Exploring Hybrid CPU\/GPU LLM Inference"}]},{"@type":"WebSite","@id":"https:\/\/www.pugetsystems.com\/#website","url":"https:\/\/www.pugetsystems.com\/","name":"Puget Systems","description":"Workstations for creators.","publisher":{"@id":"https:\/\/www.pugetsystems.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pugetsystems.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.pugetsystems.com\/#organization","name":"Puget Systems","url":"https:\/\/www.pugetsystems.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png","contentUrl":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png","width":2560,"height":363,"caption":"Puget Systems"},"image":{"@id":"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/PugetSystems","https:\/\/x.com\/PugetSystems","https:\/\/www.instagram.com\/pugetsystems\/","https:\/\/www.linkedin.com\/company\/puget-systems","https:\/\/www.youtube.com\/user\/pugetsys","https:\/\/en.wikipedia.org\/wiki\/Puget_Systems"],"telephone":"(425) 458-0273","legalName":"Puget Sound Systems, Inc.","foundingDate":"2000-12-01","duns":"128267585","naics":"334111"}]}},"_links":{"self":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts\/35198","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts"}],"about":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/types\/hpc_post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/users\/166"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/comments?post=35198"}],"version-history":[{"count":0,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts\/35198\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/media\/35203"}],"wp:attachment":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/media?parent=35198"}],"wp:term":[{"taxonomy":"hpc_category","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_categories?post=35198"},{"taxonomy":"hpc_tag","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_tags?post=35198"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/coauthors?post=35198"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}