{"id":23187,"date":"2024-02-16T10:06:51","date_gmt":"2024-02-16T18:06:51","guid":{"rendered":"https:\/\/www.pugetsystems.com\/?post_type=hpc_post&#038;p=23187"},"modified":"2024-02-16T10:06:54","modified_gmt":"2024-02-16T18:06:54","slug":"benchmarking-with-tensorrt-llm","status":"publish","type":"hpc_post","link":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/","title":{"rendered":"Benchmarking with TensorRT-LLM"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_80 counter-hierarchy ez-toc-counter ez-toc-transparent ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#Introduction\" >Introduction<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#Test_Setup\" >Test Setup<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#Results\" >Results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#Closing_Thoughts\" >Closing Thoughts<\/a><\/li><\/ul><\/nav><\/div>\n\n<h2 class=\"wp-block-heading\" id=\"h-introduction\"><span class=\"ez-toc-section\" id=\"Introduction\"><\/span>Introduction<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Here at Puget Systems, we do a lot of hardware evaluation and testing that we <a href=\"https:\/\/www.pugetsystems.com\/all-articles\/\">freely publish<\/a> and make available to the public. At the moment, most of our testing is focused on content creation workflows like video editing, photography, and game development. However, we\u2019re currently evaluating AI\/ML-focused benchmarks to implement into our testing suite to better understand how hardware choices affect the performance of these workloads. One of these benchmarks comes from NVIDIA in the form of <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-LLM\"><strong>TensorRT-LLM<\/strong><\/a>, and in this post, I\u2019d like to talk about TensorRT-LLM and share some preliminary inference results from a selection of NVIDIA GPUs.<\/p>\n\n\n\n<p>Here&#8217;s how TensorRT-LLM is described: \u201cTensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.\u201c<\/p>\n\n\n\n<p>Based on the name alone, it&#8217;s safe to assume that TensorRT-LLM performance benchmarks will scale closely with Tensor Core performance. Since all the GPUs I tested feature 4th-generation Tensor Cores, comparing the Tensor Core count per GPU should give us a reasonable metric to estimate the performance for each model. However, as the results will soon show, there is more to an LLM workload than raw computational power. The width of a GPU&#8217;s memory bus, and more holistically, the overall memory bandwidth, is an important variable to consider when selecting GPUs for machine learning tasks.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>GPU<\/th><th>VRAM (GB)<\/th><th>Tensor Cores<\/th><th>Memory Bus Width<\/th><th>Memory Bandwidth<\/th><\/tr><\/thead><tbody><tr><td>NVIDIA GeForce RTX 4090<\/td><td>24<\/td><td>512<\/td><td>384-bit<\/td><td>~1000 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4080 SUPER<\/td><td>16<\/td><td>320<\/td><td>256-bit<\/td><td>~735 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4080<\/td><td>16<\/td><td>304<\/td><td>256-bit<\/td><td>~715 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4070 Ti SUPER<\/td><td>16<\/td><td>264<\/td><td>256-bit<\/td><td>~670 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4070 Ti<\/td><td>12<\/td><td>240<\/td><td>192-bit<\/td><td>~500 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4070 SUPER<\/td><td>12<\/td><td>224<\/td><td>192-bit<\/td><td>~500 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4070<\/td><td>12<\/td><td>184<\/td><td>192-bit<\/td><td>~500 GB\/s<\/td><\/tr><tr><td>NVIDIA GeForce RTX 4060 Ti<\/td><td>8<\/td><td>136<\/td><td>128-bit<\/td><td>~290 GB\/s<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>NVIDIA was kind enough to send us a package for TensorRT-LLM v0.5.0 containing a number of scripts to simplify the installation of the dependencies, create virtual environments, and properly configure the environment variables. This is all incredibly helpful when you expect to run benchmarks on a great number of systems! Additionally, these scripts are intended to set TensorRT-LLM up on Windows, making it much easier for us to implement into our current benchmark suite. <\/p>\n\n\n\n<p>However, although TensorRT-LLM supports tensor-parallelism and pipeline parallelism, it appears that multi-GPU usage may be restricted to Linux, as the documentation states that \u201cTensorRT-LLM is supported on bare-metal Windows for single-GPU inference.\u201d Another limitation of this tool is that we can only use it to test NVIDIA GPUs, leaving out CPU inference, AMD GPUs, and Intel GPUs. Although considering the current state of NVIDIA\u2019s dominance in this field, there&#8217;s still value in a tool for comparing the capabilities and relative performance of NVIDIA GPUs. <\/p>\n\n\n\n<p>Another consideration is that, like TensorRT for StableDiffusion, an engine must be generated for each LLM model and GPU combination. However, I was surprised to find that an engine generated for one GPU did not prevent the benchmark from being completed when used with another GPU. Using mismatched engines did occasionally impact performance depending on the test variables, so as expected, the best practice is to generate a new engine for each GPU. I also suspect the output text generated would likely be meaningless when an incorrect engine is used, but these benchmarks don&#8217;t display any output.<\/p>\n\n\n\n<p>Despite all these caveats, we look forward to seeing how different GPUs perform with this LLM package with TensorRT optimizations. We will start by only looking at NVIDIA\u2019s GeForce line, but we hope to expand this testing to include the Professional RTX cards and a range of other LLM packages in the future.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-test-setup\"><span class=\"ez-toc-section\" id=\"Test_Setup\"><\/span>Test Setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">CPU: <a href=\"https:\/\/www.pugetsystems.com\/parts\/CPU\/AMD-Ryzen-Threadripper-Pro-5995WX-2-7GHz-64-Core-280W-14500\/\" target=\"_blank\" rel=\"noreferrer noopener\">AMD Threadripper PRO 5995WX 64-Core<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">CPU Cooler: <a href=\"https:\/\/www.pugetsystems.com\/parts\/CPU-Cooling\/Noctua-NH-U14S-TR4-SP3-13472\/\/\/\" target=\"_blank\" rel=\"noreferrer noopener\">Noctua NH-U14S TR4-SP3 (AMD TR4)<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Motherboard: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Motherboard\/Asus-Pro-WS-WRX80E-SAGE-SE-WIFI-13932\/\/\" target=\"_blank\" rel=\"noreferrer noopener\">ASUS Pro WS WRX80E-SAGE SE WIFI<\/a><br><em>BIOS Version: 1201<\/em><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">RAM: 8x <a href=\"https:\/\/www.pugetsystems.com\/parts\/Ram\/Micron-DDR4-3200-16GB-ECC-Reg-13638\" target=\"_blank\" rel=\"noreferrer noopener\">Micron DDR4-3200 16GB ECC Reg.<\/a><br>(128GB total)<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">GPUs:<br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-GeForce-RTX-4090-24GB-Founders-Edition-14604\/\">NVIDIA GeForce RTX 4090 24GB Founders Edition<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-GeForce-RTX-4080-16GB-Founders-Edition-14610\/\">NVIDIA GeForce RTX 4080 16GB Founders Edition<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/PNY-GeForce-RTX-4070-Ti-SUPER-Verto-16GB-15378\/\">PNY GeForce RTX 4070 Ti SUPER Verto 16GB<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-GeForce-RTX-4070-SUPER-12GB-Founders-Edition-15352\/\">NVIDIA GeForce RTX 4070 SUPER 12GB Founders Edition<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/Asus-GeForce-RTX-4070-Ti-STRIX-OC-12GB-14748\/\">Asus GeForce RTX 4070 Ti STRIX OC 12GB<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-GeForce-RTX-4070-12GB-Founders-Edition-14901\/\">NVIDIA GeForce RTX 4070 12GB Founders Edition<\/a><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/Asus-GeForce-RTX-4060-Ti-TUF-OC-8GB-14946\/\">Asus GeForce RTX 4060 Ti TUF OC 8GB<\/a><br><em>Driver Version: 551.23<\/em><br><a href=\"https:\/\/www.pugetsystems.com\/parts\/Video-Card\/NVIDIA-GeForce-RTX-4080-SUPER-16GB-Founders-Edition-15351\">NVIDIA GeForce RTX 4080 SUPER 16GB Founders Edition<\/a><br><em>Driver Version: 551.31<\/em><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">PSU: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Power-Supply\/Super-Flower-LEADEX-Platinum-1600W-13583\/\" target=\"_blank\" rel=\"noreferrer noopener\">Super Flower LEADEX Platinum 1600W<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Storage: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Hard-Drive\/Samsung-980-Pro-2TB-Gen4-M-2-SSD-13831\/\/\" target=\"_blank\" rel=\"noreferrer noopener\">Samsung 980 Pro 2TB<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">OS: <a href=\"https:\/\/www.pugetsystems.com\/parts\/Operating-System\/Windows-11-Pro-64-bit-14215\">Windows 11 Pro 22H2 build 22621.3007<\/a><\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Software: <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-LLM\">TensorRT-LLM v0.50<\/a><br><a href=\"https:\/\/github.com\/NVIDIA\/TensorRT\">TensorRT 9.1.0.4<\/a><br><a href=\"https:\/\/developer.nvidia.com\/cudnn\">cuDNN 8.9.5 CUDA 12<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The TensorRT-LLM package we received was configured to use the Llama-2-7b model, quantized to a 4-bit AWQ format. Although TensorRT-LLM supports a variety of models and quantization methods, I chose to stick with this relatively lightweight model to test a number of GPUs without worrying too much about VRAM limitations.<\/p>\n\n\n\n<p>For each row of variables below, I ran five consecutive tests per GPU and averaged the results.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>input length<\/td><td>output length<\/td><td>batch size<\/td><\/tr><tr><td>100<\/td><td>100<\/td><td>1<\/td><\/tr><tr><td>100<\/td><td>100<\/td><td>8<\/td><\/tr><tr><td>2048<\/td><td>512<\/td><td>1<\/td><\/tr><tr><td>2048<\/td><td>512<\/td><td>8<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-results\"><span class=\"ez-toc-section\" id=\"Results\"><\/span>Results<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<script type=\"text\/javascript\">\n    jQuery(document).ready(function(){\n        var slid_to = 0;\n        jQuery('#image-gallery-12002').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n        });\n        jQuery('#image-gallery-12002LargeCarousel').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n            jQuery('#image-gallery-12002').carousel(slid_to);\n        });\n\n        jQuery('#image-gallery-12002 .carousel-item img').click(function(){\n            jQuery('#image-gallery-12002LargeCarousel').carousel(slid_to);\n        });\n    });\n<\/script>\n\n<div id=\"image-gallery-12002\" class=\"carousel carousel-dark slide gallery\" data-interval=\"false\">\n\t<div class=\"carousel-indicators\">\n\t\t            <div data-target=\"#image-gallery-12002\" data-slide-to=\"0\" class=\"active\">\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Tokens-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 1.\" \/>            <\/div>\n                        <div data-target=\"#image-gallery-12002\" data-slide-to=\"1\" >\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Latency-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 1.\" \/>            <\/div>\n            \t<\/div><!-- .carousel-indicators -->\n\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t<div class=\"carousel-item active\">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Tokens.png\"\n\t\t\t\t     alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 1.\" class=\"d-block mx-auto h-100\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-12002Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t\t\t<div class=\"carousel-item \">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Latency.png\"\n\t\t\t\t     alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 1.\" class=\"d-block mx-auto h-100\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-12002Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t<\/div>\n\t<a class=\"carousel-control-prev\" href=\"#image-gallery-12002\" role=\"button\" data-slide=\"prev\">\n\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Previous<\/span>\n\t<\/a>\n\t<a class=\"carousel-control-next\" href=\"#image-gallery-12002\" role=\"button\" data-slide=\"next\">\n\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Next<\/span>\n\t<\/a>\n<\/div>\n<div class=\"gallery-caption\"><\/div>\n\n\n\n\n\n<div class=\"modal fade\" id=\"image-gallery-12002Modal\" tabindex=\"-1\" role=\"dialog\">\n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">System Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div>\n\t\t\t<div class=\"modal-body\">\n\t\t\t\t<div id=\"image-gallery-12002LargeCarousel\" class=\"carousel carousel-dark slide modal-gallery\" data-interval=\"false\">\n\t\t\t\t\t\t\t\t\t<ol class=\"carousel-indicators\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-12002LargeCarousel\" data-slide-to=\"0\" class=\"active\"><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-12002LargeCarousel\" data-slide-to=\"1\" ><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/ol>\n\t\t\t\t\t\t\t\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item active\">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Tokens.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 1.\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-12002Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Tokens.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item \">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Latency.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 1.\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-12002Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001001-Latency.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<a class=\"carousel-control-prev\" href=\"#image-gallery-12002LargeCarousel\" role=\"button\" data-slide=\"prev\">\n\t\t\t\t\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Previous<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t\t<a class=\"carousel-control-next\" href=\"#image-gallery-12002LargeCarousel\" role=\"button\" data-slide=\"next\">\n\t\t\t\t\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Next<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t<\/div><!-- .modal-body -->\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Overall, the performance results fall right in line with expectations based on the Tensor Core count of each card. However, we also see the performance between cards featuring the same memory bus width is quite close, despite the relative differences in Tensor Core count. The most notable example of this occurred between the 4070 and 4070 Ti. Although the 4070 Ti has roughly 30% more Tensor Cores than the base 4070, this only translated into a difference of about two tokens per second. <\/p>\n\n\n\n<script type=\"text\/javascript\">\n    jQuery(document).ready(function(){\n        var slid_to = 0;\n        jQuery('#image-gallery-30929').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n        });\n        jQuery('#image-gallery-30929LargeCarousel').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n            jQuery('#image-gallery-30929').carousel(slid_to);\n        });\n\n        jQuery('#image-gallery-30929 .carousel-item img').click(function(){\n            jQuery('#image-gallery-30929LargeCarousel').carousel(slid_to);\n        });\n    });\n<\/script>\n\n<div id=\"image-gallery-30929\" class=\"carousel carousel-dark slide gallery\" data-interval=\"false\">\n\t<div class=\"carousel-indicators\">\n\t\t            <div data-target=\"#image-gallery-30929\" data-slide-to=\"0\" class=\"active\">\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Tokens-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 8.\" \/>            <\/div>\n                        <div data-target=\"#image-gallery-30929\" data-slide-to=\"1\" >\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Latency-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 8.\" \/>            <\/div>\n            \t<\/div><!-- .carousel-indicators -->\n\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t<div class=\"carousel-item active\">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Tokens.png\"\n\t\t\t\t     alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 8.\" class=\"d-block mx-auto h-100\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-30929Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t\t\t<div class=\"carousel-item \">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Latency.png\"\n\t\t\t\t     alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 8.\" class=\"d-block mx-auto h-100\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-30929Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t<\/div>\n\t<a class=\"carousel-control-prev\" href=\"#image-gallery-30929\" role=\"button\" data-slide=\"prev\">\n\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Previous<\/span>\n\t<\/a>\n\t<a class=\"carousel-control-next\" href=\"#image-gallery-30929\" role=\"button\" data-slide=\"next\">\n\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Next<\/span>\n\t<\/a>\n<\/div>\n<div class=\"gallery-caption\"><\/div>\n\n\n\n\n\n<div class=\"modal fade\" id=\"image-gallery-30929Modal\" tabindex=\"-1\" role=\"dialog\">\n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">System Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div>\n\t\t\t<div class=\"modal-body\">\n\t\t\t\t<div id=\"image-gallery-30929LargeCarousel\" class=\"carousel carousel-dark slide modal-gallery\" data-interval=\"false\">\n\t\t\t\t\t\t\t\t\t<ol class=\"carousel-indicators\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-30929LargeCarousel\" data-slide-to=\"0\" class=\"active\"><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-30929LargeCarousel\" data-slide-to=\"1\" ><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/ol>\n\t\t\t\t\t\t\t\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item active\">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Tokens.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing tokens per second for 100 input length, 100 output length, and batch size 8.\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-30929Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Tokens.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item \">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Latency.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing latency for 100 input length, 100 output length, and batch size 8.\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-30929Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/1001008-Latency.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<a class=\"carousel-control-prev\" href=\"#image-gallery-30929LargeCarousel\" role=\"button\" data-slide=\"prev\">\n\t\t\t\t\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Previous<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t\t<a class=\"carousel-control-next\" href=\"#image-gallery-30929LargeCarousel\" role=\"button\" data-slide=\"next\">\n\t\t\t\t\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Next<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t<\/div><!-- .modal-body -->\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Using the same input and output length of 100, but increasing the batch size to 8, we start to see a slightly wider spread across the cards that were previously grouped by their relative memory bandwidth, causing the results to trend towards tracking more closely with Tensor Cores.<\/p>\n\n\n\n<script type=\"text\/javascript\">\n    jQuery(document).ready(function(){\n        var slid_to = 0;\n        jQuery('#image-gallery-79873').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n        });\n        jQuery('#image-gallery-79873LargeCarousel').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n            jQuery('#image-gallery-79873').carousel(slid_to);\n        });\n\n        jQuery('#image-gallery-79873 .carousel-item img').click(function(){\n            jQuery('#image-gallery-79873LargeCarousel').carousel(slid_to);\n        });\n    });\n<\/script>\n\n<div id=\"image-gallery-79873\" class=\"carousel carousel-dark slide gallery\" data-interval=\"false\">\n\t<div class=\"carousel-indicators\">\n\t\t            <div data-target=\"#image-gallery-79873\" data-slide-to=\"0\" class=\"active\">\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Tokens-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 1.\" \/>            <\/div>\n                        <div data-target=\"#image-gallery-79873\" data-slide-to=\"1\" >\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Latency-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 1.\" \/>            <\/div>\n            \t<\/div><!-- .carousel-indicators -->\n\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t<div class=\"carousel-item active\">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Tokens.png\"\n\t\t\t\t     alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 1.\" class=\"d-block mx-auto h-100\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-79873Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t\t\t<div class=\"carousel-item \">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Latency.png\"\n\t\t\t\t     alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 1.\" class=\"d-block mx-auto h-100\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-79873Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t<\/div>\n\t<a class=\"carousel-control-prev\" href=\"#image-gallery-79873\" role=\"button\" data-slide=\"prev\">\n\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Previous<\/span>\n\t<\/a>\n\t<a class=\"carousel-control-next\" href=\"#image-gallery-79873\" role=\"button\" data-slide=\"next\">\n\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Next<\/span>\n\t<\/a>\n<\/div>\n<div class=\"gallery-caption\"><\/div>\n\n\n\n\n\n<div class=\"modal fade\" id=\"image-gallery-79873Modal\" tabindex=\"-1\" role=\"dialog\">\n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">System Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div>\n\t\t\t<div class=\"modal-body\">\n\t\t\t\t<div id=\"image-gallery-79873LargeCarousel\" class=\"carousel carousel-dark slide modal-gallery\" data-interval=\"false\">\n\t\t\t\t\t\t\t\t\t<ol class=\"carousel-indicators\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-79873LargeCarousel\" data-slide-to=\"0\" class=\"active\"><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-79873LargeCarousel\" data-slide-to=\"1\" ><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/ol>\n\t\t\t\t\t\t\t\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item active\">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Tokens.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 1.\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-79873Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Tokens.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item \">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Latency.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 1.\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-79873Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485121-Latency.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<a class=\"carousel-control-prev\" href=\"#image-gallery-79873LargeCarousel\" role=\"button\" data-slide=\"prev\">\n\t\t\t\t\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Previous<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t\t<a class=\"carousel-control-next\" href=\"#image-gallery-79873LargeCarousel\" role=\"button\" data-slide=\"next\">\n\t\t\t\t\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Next<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t<\/div><!-- .modal-body -->\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Increasing the input and output lengths to the maximum recommended by NVIDIA for this particular model, we once again find that with a batch size of one, the results trend closely with the available memory bandwidth. It&#8217;s slightly less pronounced than with the smaller &#8220;100,100&#8221; tests, and I suspect that the larger context requires more calculations, providing a greater benefit to having more Tensor Cores available. Despite that, it&#8217;s clear that memory bandwidth still has a major role in this test.<\/p>\n\n\n\n<script type=\"text\/javascript\">\n    jQuery(document).ready(function(){\n        var slid_to = 0;\n        jQuery('#image-gallery-27442').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n        });\n        jQuery('#image-gallery-27442LargeCarousel').on('slid.bs.carousel', function(e){\n            slid_to = e.to;\n            jQuery('#image-gallery-27442').carousel(slid_to);\n        });\n\n        jQuery('#image-gallery-27442 .carousel-item img').click(function(){\n            jQuery('#image-gallery-27442LargeCarousel').carousel(slid_to);\n        });\n    });\n<\/script>\n\n<div id=\"image-gallery-27442\" class=\"carousel carousel-dark slide gallery\" data-interval=\"false\">\n\t<div class=\"carousel-indicators\">\n\t\t            <div data-target=\"#image-gallery-27442\" data-slide-to=\"0\" class=\"active\">\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Tokens-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 8.\" \/>            <\/div>\n                        <div data-target=\"#image-gallery-27442\" data-slide-to=\"1\" >\n                <img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Latency-150x150.png\" class=\"carousel-thumbnail\" alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 8.\" \/>            <\/div>\n            \t<\/div><!-- .carousel-indicators -->\n\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t<div class=\"carousel-item active\">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Tokens.png\"\n\t\t\t\t     alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 8.\" class=\"d-block mx-auto h-100\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-27442Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t\t\t<div class=\"carousel-item \">\n\n                \t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Latency.png\"\n\t\t\t\t     alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 8.\" class=\"d-block mx-auto h-100\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-27442Modal\" \/>\n                \t\t\t<\/div>\n\t\t\t\t<\/div>\n\t<a class=\"carousel-control-prev\" href=\"#image-gallery-27442\" role=\"button\" data-slide=\"prev\">\n\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Previous<\/span>\n\t<\/a>\n\t<a class=\"carousel-control-next\" href=\"#image-gallery-27442\" role=\"button\" data-slide=\"next\">\n\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t<span class=\"sr-only\">Next<\/span>\n\t<\/a>\n<\/div>\n<div class=\"gallery-caption\"><\/div>\n\n\n\n\n\n<div class=\"modal fade\" id=\"image-gallery-27442Modal\" tabindex=\"-1\" role=\"dialog\">\n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">System Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div>\n\t\t\t<div class=\"modal-body\">\n\t\t\t\t<div id=\"image-gallery-27442LargeCarousel\" class=\"carousel carousel-dark slide modal-gallery\" data-interval=\"false\">\n\t\t\t\t\t\t\t\t\t<ol class=\"carousel-indicators\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-27442LargeCarousel\" data-slide-to=\"0\" class=\"active\"><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<li data-target=\"#image-gallery-27442LargeCarousel\" data-slide-to=\"1\" ><\/li>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/ol>\n\t\t\t\t\t\t\t\t\t<div class=\"carousel-inner\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item active\">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Tokens.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing tokens per second for 2048 input length, 512 output length, and batch size 8.\" data-id=\"0\" data-toggle=\"modal\" data-target=\"#image-gallery-27442Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Tokens.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"carousel-item \">\n                                <img loading=\"lazy\" decoding=\"async\" width=\"1238\" height=\"1277\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Latency.png\" class=\"d-block mx-auto h-100\" alt=\"Chart listing latency for 2048 input length, 512 output length, and batch size 8.\" data-id=\"1\" data-toggle=\"modal\" data-target=\"#image-gallery-27442Modal\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<div class=\"full-res-image-wrapper text-center\">\n\t\t\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/20485128-Latency.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<a class=\"carousel-control-prev\" href=\"#image-gallery-27442LargeCarousel\" role=\"button\" data-slide=\"prev\">\n\t\t\t\t\t\t<span class=\"carousel-control-prev-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Previous<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t\t<a class=\"carousel-control-next\" href=\"#image-gallery-27442LargeCarousel\" role=\"button\" data-slide=\"next\">\n\t\t\t\t\t\t<span class=\"carousel-control-next-icon\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t<span class=\"sr-only\">Next<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t<\/div><!-- .modal-body -->\n\t\t<\/div>\n\t<\/div>\n<\/div>\n\n\n<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>With this test, had I not finally encountered VRAM limitations, I suspect we would have seen the same pattern emerge where the larger batch size seemingly scales slightly better with Tensor Core count. The VRAM requirements for this test jumped to just over 16GB compared to the roughly 7GB of the previous tests, so we see the cards with 12GB VRAM and under struggle. <\/p>\n\n\n\n<p>A notable factor in this test was that the tokens per second metric corresponded closely with the total time spent completing a benchmark run. The RTX 4090 completed this test in about 35 seconds, and the 16GB cards each took a little over 50 seconds. However, without sufficient VRAM, the benchmark completion time jumped to around 260 seconds for the 12GB cards and a whopping 960 seconds for the 8GB 4060Ti! <\/p>\n\n\n\n<p>Due to driver changes NVIDIA introduced back in driver version 535.98 <a href=\"https:\/\/www.pugetsystems.com\/support\/guides\/video-memory-bsod\/\">to resolve crashes when VRAM capacity is maxed out<\/a>, instead of the benchmark failing, it overflows into the much slower system memory. This drastically impacts performance, which may or may not be preferred over a memory allocation error and a failure to run. Many Stable Diffusion users were caught off guard by this change, wondering why their performance would sometimes reach previously unseen lows. NVIDIA did implement <a href=\"https:\/\/nvidia.custhelp.com\/app\/answers\/detail\/a_id\/5490\">a toggle for this behavior in the NVIDIA Control Panel<\/a>, but at least in my testing for TensorRT-LLM, the &#8220;Prefer No Sysmem Fallback&#8221; global setting did not prevent the benchmark from utilizing system memory when the VRAM capacity was reached.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-closing-thoughts\"><span class=\"ez-toc-section\" id=\"Closing_Thoughts\"><\/span>Closing Thoughts<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Looking back on the results of the TensorRT-LLM benchmarks, it becomes clear that in addition to the raw compute power of a GPU, the memory bandwidth has an important role in overall performance. Any time that cores spend waiting for data to be fed to them is time wasted and represents a loss in overall performance. This makes it easier to understand why the GeForce RTX 4090 has been dominant relative to other consumer-oriented GPUs; its high Tensor Core count and wide memory bus are well-suited for machine learning workloads.<\/p>\n\n\n\n<p>If I were to perform this testing again (and we likely will do so in the future!), I would like to include more &#8220;middle-ground&#8221; testing, with an input and output length of 512 and batches of 1 and 8. I think this would help establish the emerging performance patterns without exceeding the smaller GPUs&#8217; VRAM budgets. In addition, I&#8217;m eager to see how the tool can be utilized under Linux for testing multi-GPU configurations and how NVIDIA\u2019s Professional RTX cards perform.<\/p>\n\n\n\n<p>Despite some limitations, such as only supporting single-GPU inference within Windows and requiring custom engines to be generated for each model and GPU combination, TensorRT-LLM is a flexible tool for deploying and testing various LLM models on NVIDIA GPUs. Whether we will implement it within our standard benchmarking suite remains to be seen, but it offers compelling reasons to consider it. Although we prefer to test in a manner that allows for multiple brands and types of GPUs to be used, there is certainly a place for brand-specific optimizations like this, especially in the AI\/ML space, where those optimizations can provide significant benefits.<\/p>\n\n\n\n<div class=\"mod-img wp-block-image aligncenter\" data-target=\"single-image-modal-2811\">\n<figure class=\"aligncenter\">\n\t<!-- If image is not empty, print image, else, print from image URL -->\n\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\" alt=\"\" ><\/img>\n\t\t<\/figure>\n<\/div>\n\n<!-- Displays caption if entered -->\n\n<!-- Displays modal upon click of an image -->\n<script type=\"text\/javascript\">\n\t\/\/Using unique random generated id\n\tjQuery(document).ready(function(){\n\t\tjQuery('[data-target=\"single-image-modal-2811\"]').click(function(){\n            jQuery('#single-image-modal-2811Modal').modal('show');\n\t\t\t});\n        });\n<\/script>\n\n<div class=\"modal fade popup-image\" id=\"single-image-modal-2811Modal\" tabindex=\"-1\" role=\"dialog\"> \n\t<div class=\"modal-dialog modal-xl\" role=\"document\">\n\t\t<div class=\"modal-content\">\n\t\t\t<div class=\"modal-header\">\n\t\t\t\t<h5 class=\"modal-title\">Image<\/h5>\n\t\t\t\t<button type=\"button\" class=\"close\" data-dismiss=\"modal\" aria-label=\"Close\">\n\t\t\t\t\t<span aria-hidden=\"true\">&times;<\/span>\n\t\t\t\t<\/button>\n\t\t\t<\/div> <!-- \/modal-header -->\n\n            <div class=\"modal-body inner-modal\">\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\" alt=\"\" \/>\n\t\t\t\t\t\t<div class=\"text-center full-res-image-wrapper\">\n\t\t\t\t\t\t\t<a class=\"btn btn-light btn-lg full-res-image-link\" href=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\" target=\"_blank\">Open Full Resolution <i class=\"fas fa-external-link-alt\"><\/i><\/a>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div><!-- \/modal-body -->\n    \t<\/div><!-- \/modal-content -->\n    <\/div><!-- \/modal-dialog -->\n<\/div><!-- \/modal fade --> \n\n\n\n<p class=\"has-text-align-center\"><em>Image Credit: NVIDIA Technical Blog &#8211; <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus\/\">TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs<\/a><\/em><\/p>\n\n\n<div class=\"wp-bootstrap-blocks-row row puget-icon-section\">\n\t\n\n<div class=\"col-12 col-md-6\">\n\t\t\t\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-thumbnail is-resized text-center\"><a href=\"https:\/\/www.pugetsystems.com\/solutions\/scientific-computing-workstations\/machine-learning-ai\/\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/12\/computer-icon.png\" alt=\"Tower Computer Icon in Puget Systems Colors\" class=\"wp-image-12659\" style=\"width:113px;height:113px\" title=\"\"\/><\/a><\/figure>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"h-looking-for-an-ai-workstation-or-server\">Looking for an AI workstation or server?<\/h4>\n\n\n\n<p class=\"has-text-align-center\">We build computers tailor-made for your workflow.&nbsp;<\/p>\n\n\n<div class=\"wp-bootstrap-blocks-button text-center\">\n\t<a\n\t\thref=\"https:\/\/www.pugetsystems.com\/solutions\/scientific-computing-workstations\/machine-learning-ai\/\"\n\t\t\t\t\t\tclass=\"btn btn-primary\"\n\t>\n\t\tConfigure a System\t<\/a>\n<\/div>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\t<\/div>\n\n\n\n<div class=\"col-12 col-md-6\">\n\t\t\t\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-thumbnail is-resized text-center\"><a href=\"\/contact-expert\/\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/talking-icon.png\" alt=\"Talking Head Icon in Puget Systems Colors\" class=\"wp-image-12657\" style=\"width:113px;height:113px\" title=\"\"\/><\/a><\/figure>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"h-don-t-know-where-to-start-we-can-help\">Don&#8217;t know where to start?<br>We can help!<\/h4>\n\n\n\n<p class=\"has-text-align-center\">Get in touch with one of our technical consultants today.<\/p>\n\n\n<div class=\"wp-bootstrap-blocks-button text-center\">\n\t<a\n\t\thref=\"\/contact-expert\/\"\n\t\t\t\t\t\tclass=\"btn btn-primary\"\n\t>\n\t\tTalk to an Expert\t<\/a>\n<\/div>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\t<\/div>\n\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"h-related-content\">Related Content<\/h3>\n\n\n \n<div class=\"related-content\">\n\t<ul class=\"related-content-list\">\n\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/standing-up-ai-development-quickly-for-supercomputing-2025\/\" title=\"Standing Up AI Development Quickly for Supercomputing 2025\">Standing Up AI Development Quickly for Supercomputing 2025<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" title=\"Exploring Hybrid CPU\/GPU LLM Inference\">Exploring Hybrid CPU\/GPU LLM Inference<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/whats-the-deal-with-npus\/\" title=\"What&#8217;s the deal with NPUs?\">What&#8217;s the deal with NPUs?<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"related-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/local-alternatives-to-cloud-ai-services\/\" title=\"Local alternatives to Cloud AI services\">Local alternatives to Cloud AI services<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t<\/ul>\n\t \n\t<a class=\"view-term-link\" href=\"\/all_articles?filter=machine-learning\">View\n\t\tAll Related Content<\/a>\n\t<\/div><\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"h-latest-content\">Latest Content<\/h3>\n\n\n \n<div class=\"latest-content\">\n\t<ul class=\"latest-content-list\">\n\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/standing-up-ai-development-quickly-for-supercomputing-2025\/\" title=\"Standing Up AI Development Quickly for Supercomputing 2025\">Standing Up AI Development Quickly for Supercomputing 2025<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/exploring-hybrid-cpu-gpu-llm-inference\/\" title=\"Exploring Hybrid CPU\/GPU LLM Inference\">Exploring Hybrid CPU\/GPU LLM Inference<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/whats-the-deal-with-npus\/\" title=\"What&#8217;s the deal with NPUs?\">What&#8217;s the deal with NPUs?<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t\t\t\t<li class=\"latest-content-list-item\">\n\t\t\t\t\t<a href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/local-alternatives-to-cloud-ai-services\/\" title=\"Local alternatives to Cloud AI services\">Local alternatives to Cloud AI services<\/a>\n\t\t\t\t<\/li>\n\t\t\t\t<\/ul>\n\t \n\t\t<a href=\"https:\/\/www.pugetsystems.com\/all-hpc\/\" class=\"view-posts-link\">View All<\/a>\n\t<\/div><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA&#8217;s TensorRT-LLM tool for benchmarking GPU inference performance.<\/p>\n","protected":false},"author":166,"featured_media":23408,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","meta":{"_acf_changed":false,"content-type":"","classic-editor-remember":"","legacy_id":"","redirect_url":[],"expire_date":"","alert_message":"","alert_link":[],"configure_ids":"","system_grid_title":"","system_grid_ids":"","footnotes":""},"hpc_categories":[8879,8883,8885],"hpc_tags":[8765,8779],"coauthors":[9063],"class_list":["post-23187","hpc_post","type-hpc_post","status-publish","has-post-thumbnail","hentry","hpc_category-hardware","hpc_category-machine-learning","hpc_category-software","hpc_tag-machine-learning","hpc_tag-nvidia"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v26.7 (Yoast SEO v26.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Benchmarking with TensorRT-LLM | Puget Systems<\/title>\n<meta name=\"description\" content=\"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA&#039;s TensorRT-LLM tool for benchmarking GPU inference performance.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Benchmarking with TensorRT-LLM\" \/>\n<meta property=\"og:description\" content=\"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA&#039;s TensorRT-LLM tool for benchmarking GPU inference performance.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/\" \/>\n<meta property=\"og:site_name\" content=\"Puget Systems\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/PugetSystems\" \/>\n<meta property=\"article:modified_time\" content=\"2024-02-16T18:06:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/inference-visual-tensor-rt-llm-1024x576.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"576\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@PugetSystems\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"8 minutes\" \/>\n\t<meta name=\"twitter:label2\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data2\" content=\"Jon Allman\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/\",\"url\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/\",\"name\":\"Benchmarking with TensorRT-LLM | Puget Systems\",\"isPartOf\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\",\"datePublished\":\"2024-02-16T18:06:51+00:00\",\"dateModified\":\"2024-02-16T18:06:54+00:00\",\"description\":\"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA's TensorRT-LLM tool for benchmarking GPU inference performance.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage\",\"url\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\",\"contentUrl\":\"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png\",\"width\":1920,\"height\":1080,\"caption\":\"Image Credit: NVIDIA Technical Blog - TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.pugetsystems.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"HPC Posts\",\"item\":\"https:\/\/www.pugetsystems.com\/all-hpc\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Benchmarking with TensorRT-LLM\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.pugetsystems.com\/#website\",\"url\":\"https:\/\/www.pugetsystems.com\/\",\"name\":\"Puget Systems\",\"description\":\"Workstations for creators.\",\"publisher\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.pugetsystems.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.pugetsystems.com\/#organization\",\"name\":\"Puget Systems\",\"url\":\"https:\/\/www.pugetsystems.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png\",\"contentUrl\":\"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png\",\"width\":2560,\"height\":363,\"caption\":\"Puget Systems\"},\"image\":{\"@id\":\"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/PugetSystems\",\"https:\/\/x.com\/PugetSystems\",\"https:\/\/www.instagram.com\/pugetsystems\/\",\"https:\/\/www.linkedin.com\/company\/puget-systems\",\"https:\/\/www.youtube.com\/user\/pugetsys\",\"https:\/\/en.wikipedia.org\/wiki\/Puget_Systems\"],\"telephone\":\"(425) 458-0273\",\"legalName\":\"Puget Sound Systems, Inc.\",\"foundingDate\":\"2000-12-01\",\"duns\":\"128267585\",\"naics\":\"334111\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Benchmarking with TensorRT-LLM | Puget Systems","description":"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA's TensorRT-LLM tool for benchmarking GPU inference performance.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/","og_locale":"en_US","og_type":"article","og_title":"Benchmarking with TensorRT-LLM","og_description":"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA's TensorRT-LLM tool for benchmarking GPU inference performance.","og_url":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/","og_site_name":"Puget Systems","article_publisher":"https:\/\/www.facebook.com\/PugetSystems","article_modified_time":"2024-02-16T18:06:54+00:00","og_image":[{"width":1024,"height":576,"url":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/inference-visual-tensor-rt-llm-1024x576.png","type":"image\/png"}],"twitter_card":"summary_large_image","twitter_site":"@PugetSystems","twitter_misc":{"Est. reading time":"8 minutes","Written by":"Jon Allman"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/","url":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/","name":"Benchmarking with TensorRT-LLM | Puget Systems","isPartOf":{"@id":"https:\/\/www.pugetsystems.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage"},"image":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage"},"thumbnailUrl":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png","datePublished":"2024-02-16T18:06:51+00:00","dateModified":"2024-02-16T18:06:54+00:00","description":"Evaluating the speed of GeForce RTX 40-Series GPUs using NVIDIA's TensorRT-LLM tool for benchmarking GPU inference performance.","breadcrumb":{"@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#primaryimage","url":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png","contentUrl":"https:\/\/wp-cdn.pugetsystems.com\/2022\/08\/inference-visual-tensor-rt-llm-e1707849286660.png","width":1920,"height":1080,"caption":"Image Credit: NVIDIA Technical Blog - TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs"},{"@type":"BreadcrumbList","@id":"https:\/\/www.pugetsystems.com\/labs\/hpc\/benchmarking-with-tensorrt-llm\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pugetsystems.com\/"},{"@type":"ListItem","position":2,"name":"HPC Posts","item":"https:\/\/www.pugetsystems.com\/all-hpc\/"},{"@type":"ListItem","position":3,"name":"Benchmarking with TensorRT-LLM"}]},{"@type":"WebSite","@id":"https:\/\/www.pugetsystems.com\/#website","url":"https:\/\/www.pugetsystems.com\/","name":"Puget Systems","description":"Workstations for creators.","publisher":{"@id":"https:\/\/www.pugetsystems.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pugetsystems.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.pugetsystems.com\/#organization","name":"Puget Systems","url":"https:\/\/www.pugetsystems.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png","contentUrl":"https:\/\/www.pugetsystems.com\/wp-content\/uploads\/2022\/08\/Puget-Systems-2020-logo-color-full.png","width":2560,"height":363,"caption":"Puget Systems"},"image":{"@id":"https:\/\/www.pugetsystems.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/PugetSystems","https:\/\/x.com\/PugetSystems","https:\/\/www.instagram.com\/pugetsystems\/","https:\/\/www.linkedin.com\/company\/puget-systems","https:\/\/www.youtube.com\/user\/pugetsys","https:\/\/en.wikipedia.org\/wiki\/Puget_Systems"],"telephone":"(425) 458-0273","legalName":"Puget Sound Systems, Inc.","foundingDate":"2000-12-01","duns":"128267585","naics":"334111"}]}},"_links":{"self":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts\/23187","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts"}],"about":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/types\/hpc_post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/users\/166"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/comments?post=23187"}],"version-history":[{"count":0,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_posts\/23187\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/media\/23408"}],"wp:attachment":[{"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/media?parent=23187"}],"wp:term":[{"taxonomy":"hpc_category","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_categories?post=23187"},{"taxonomy":"hpc_tag","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/hpc_tags?post=23187"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.pugetsystems.com\/wp-json\/wp\/v2\/coauthors?post=23187"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}