






因此,想要快速跑TensorRT-LLM,建议直接将nvidia-driver升级到535.xxx,利用docker跑即可,省去自己折腾环境, 至于想要自定义修改源码,也在docker中搞就可以

理论上替换原始代码中的该部分就可以使用别的cuda版本了(batch manager只是不开源,和cuda版本应该没关系,主要是FMA模块,另外TensorRT-llm依赖的TensorRT有cuda11.x版本,配合inflight_batcher_llm跑的triton-inference-server也和cuda12.x没有强制依赖关系):






docker pull nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3



docker run -it -d --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --security-opt seccomp=unconfined --gpus=all --shm-size=16g --privileged --ulimit memlock=-1 --name=develop nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash



首先获取git仓库,因为这个镜像中 只有运行需要的lib ,模型还是需要自行编译的(因为依赖的TensorRT,用过trt的都知道需要构建engine),所以首先编译tensorrRT-LLM:

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull


python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

一般不会有环境问题,这个docekr中已经包含了所有需要的包,执行build_wheel的时候会按照脚本中的步骤pip install一些需要的包,然后运行cmake和make编译文件:

adding 'tensorrt_llm/tools/plugin_gen/templates/functional.py.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin.cpp.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin.h.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin_common.cpp'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin_common.h'
adding 'tensorrt_llm/tools/plugin_gen/templates/tritonPlugins.cpp.tpl'
adding 'tensorrt_llm-0.5.0.dist-info/LICENSE'
adding 'tensorrt_llm-0.5.0.dist-info/METADATA'
adding 'tensorrt_llm-0.5.0.dist-info/WHEEL'
adding 'tensorrt_llm-0.5.0.dist-info/top_level.txt'
adding 'tensorrt_llm-0.5.0.dist-info/zip-safe'
adding 'tensorrt_llm-0.5.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
Successfully built tensorrt_llm-0.5.0-py3-none-any.whl

然后pip install tensorrt_llm-0.5.0-py3-none-any.whl即可。


首先编译模型,因为最近没有下载新模型,还是拿旧的llama做例子。其实吧,其他llm也一样(chatglm、qwen等等),只要trt-llm支持,编译运行方法都一样的,在hugging face下载好要测试的模型即可。


python /work/code/TensorRT-LLM/examples/llama/build.py 
                --model_dir /work/models/GPT/LLAMA/llama-7b-hf   # 可以替换为你自己的llm模型
                --dtype float16 
                --use_gpt_attention_plugin float16 
                --use_gemm_plugin float16 
                --use_inflight_batching   # 开启inflight batching
                --output_dir /work/trtModel/llama/1-gpu




cd tensorrtllm_backend
mkdir triton_model_repo

# 拷贝出来模板模型文件夹
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# 将刚才生成好的`/work/trtModel/llama/1-gpu`移动到模板模型文件夹中
cp /work/trtModel/llama/1-gpu/* triton_model_repo/tensorrt_llm/1



python3 scripts/launch_triton_server.py --world_size=1 --model_repo=triton_model_repo


root@6aaab84e59c0:/work/code/tensorrtllm_backend# I1105 14:16:58.286836 2561098 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ffb76000000' with size 268435456
I1105 14:16:58.286973 2561098 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1105 14:16:58.288120 2561098 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1105 14:16:58.288135 2561098 model_lifecycle.cc:461] loading: preprocessing:1
I1105 14:16:58.288142 2561098 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1105 14:16:58.392915 2561098 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1105 14:16:58.392979 2561098 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1105 14:16:58.732165 2561098 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1105 14:16:59.383255 2561098 model_lifecycle.cc:818] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13144, GPU 13111 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13146, GPU 13121 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13164, GPU 14363 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13164, GPU 14371 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13198, GPU 14391 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13198, GPU 14401 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] Using 2878 tokens in paged KV cache.
I1105 14:17:17.299293 2561098 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1105 14:17:17.303661 2561098 model_lifecycle.cc:461] loading: ensemble:1
I1105 14:17:17.305897 2561098 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1105 14:17:17.306051 2561098 server.cc:592] 
| Repository Agent | Path |

I1105 14:17:17.306401 2561098 server.cc:619] 
| Backend     | Path                                                            | Config                                                                                               |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-com |
|             |                                                                 | pute-capability":"6.000000","default-max-batch-size":"4"}}                                           |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-com |
|             |                                                                 | pute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}       |

I1105 14:17:17.307053 2561098 server.cc:662] 
| Model          | Version | Status |
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |

I1105 14:17:17.393318 2561098 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA RTX A4000
I1105 14:17:17.393534 2561098 metrics.cc:710] Collecting CPU metrics
I1105 14:17:17.394550 2561098 tritonserver.cc:2458] 
| Option                           | Value                                                                                                                                              |
| server_id                        | triton                                                                                                                                             |
| server_version                   | 2.39.0                                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_ |
|                                  | memory binary_tensor_data parameters statistics trace logging                                                                                      |
| model_repository_path[0]         | /work/triton_models/inflight_batcher_llm                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                                          |
| strict_model_config              | 1                                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                           |
| min_supported_compute_capability | 6.0                                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                                 |
| cache_enabled                    | 0                                                                                                                                                  |

I1105 14:17:17.423479 2561098 grpc_server.cc:2513] Started GRPCInferenceService at
I1105 14:17:17.424418 2561098 http_server.cc:4497] Started HTTPService at



Sun Nov  5 14:20:46 2023       
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 41%   34C    P8              16W / 140W |  15855MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |



# 执行
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

# 得到返回结果
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" ⁇  What is machine learning? Machine learning is a subfield of computer science that focuses on the development of algorithms that can learn"}


pip install tritonclient[all]


python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming


output_ids =  [[0, 19298, 297, 6641, 29899, 23027, 3444, 29892, 1105, 7598, 16370, 408, 263, 14547, 297, 3681, 1434, 8401, 304, 4517, 297, 29871, 29896, 29947, 29946, 29955, 29889, 940, 3796, 472, 278, 23933, 5977, 322, 278, 7021, 16923, 297, 29258, 265, 1434, 8718, 670, 1914, 27144, 297, 29871, 29896, 29947, 29945, 29896, 29889, 940, 471, 263, 29323, 261, 310, 278, 671, 310, 21837, 7984, 292, 322, 471, 278, 937, 304, 671, 263, 10489, 380, 994, 29889, 940, 471, 884, 263, 410, 29880, 928, 9227, 322, 670, 8277, 5134, 450, 315, 4664, 457, 310, 3444, 313, 29896, 29947, 29945, 29896, 511, 450, 315, 4664, 457, 310, 12730, 313, 29896, 29947, 29945, 29946, 511, 450, 315, 4664, 457, 310, 13616, 313, 29896, 29947, 29945, 29945, 511, 450, 315, 4664, 457, 310, 9556, 313, 29896, 29947, 29945, 29955, 511, 450, 315, 4664, 457, 310, 17362, 313, 29896, 29947, 29945, 29947, 511, 450, 315, 4664, 457, 310, 12710, 313, 29896, 29947, 29945, 29929, 511, 450, 315, 4664, 457, 310, 14198, 653, 313, 29896, 29947, 29953, 29900, 511, 450, 315, 4664, 457, 310, 28806, 313, 29896, 29947, 29953, 29896, 511, 450, 315, 4664, 457, 310, 27440, 313, 29896, 29947, 29953, 29906, 511, 450, 315, 4664, 457, 310, 24506, 313, 29896, 29947, 29953, 29941, 511, 450, 315, 4664, 457, 310]]
Input: Born in north-east France, Soyer trained as a
Output:  chef in Paris before moving to London in 1 847. He worked at the Reform Club and the Royal Hotel in Brighton before opening his own restaurant in 1 851 . He was a pioneer of the use of steam cooking and was the first to use a gas stove. He was also a prolific writer and his books included The Cuisine of France (1 851 ), The Cuisine of Italy (1 854), The Cuisine of Spain (1 855), The Cuisine of Germany (1 857), The Cuisine of Austria (1 858), The Cuisine of Russia (1 859), The Cuisine of Hungary (1 860), The Cuisine of Switzerland (1 861 ), The Cuisine of Norway (1 862), The Cuisine of Sweden (1863), The Cuisine of

因为开了inflight batching,其实可以同时多个请求打过来,修改request_id不要一样就可以:

# user 1
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming --request_id 1
# user 2
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming --request_id 2







声明:本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人,不代表电子发烧友网立场。文章及其配图仅供工程师学习之用,如有内容侵权或者其他违规问题,请联系本站处理。 举报投诉


快来发表一下你的评论吧 !
