Triton Inference Serve
Triton Inference Serve框架
总体架构
**服务端:**模型仓库->backend->硬件
从模型仓库加载模型,根据模型的类型,选择特定的backend,将模型运行在特定的硬件上。
**客户端:**编程语言->query->服务端
客户端可以使用python或者C++等编程语言的库,通过HTTP、gRPC协议,或者直接使用C API进行调用。服务端收到请求后,会调度器会调度请求给模型进行处理,返回推理结果。
推理框架
- 推理框架一般包括客户端端和服务端,triton server是服务端部分。
- 使用triton的场景,一般使用k8s来管理triton应用,解决负载均衡,动态扩容等问题。
- model repository:模型仓库,用来管理模型文件。
- metrics service:监控整个推理服务,驾驶舱。
- 推理服务一般开启多个部署,用以分担推理请求压力。triton实际上就是绿色部分。
- triton支持多种深度学习框架导出的模型,Tensorrt只是triton里的一个推理库而已。
基本功能

- 多框架支持;
- CPU、GPU,多GPU异构支持;
- 并行执行能力,CPU级别优化;
- 支持HTTP/REST,gRPC APIS;
- 监控:通过延迟和运行状况指标与编排系统和自动缩放进程集成;
- 模型管理,加载、卸载、更新;
- 开源,NGC docker仓库支持,每月发布
其他:Scheduler主要是指对推理请求队列进行调度的过程。
Triton的设计基础
生命周期角度
- 多模型框架支持——将每个模型的推理任务解耦交给不同的框架——Backends;
- 功能角度
- backend管理;
- 模型管理;
- 并发执行(多线程)——实例管理;
- 推理服务请求队列的分发和调度;
- 推理服务生命周期管理
- 推理请求管理,预处理
- 推理结果管理,后处理
- GRPC服务
模型角度
- 单一无依赖的模型——分类,检测等网络;
- 模型组合,pipeline,下一个模型基于上一个模型的结果进行推理;
- 有状态的模型——语言模型,上下文;
例子
单一模型,启动多线程进行推理:
多模型,多线程推理:
单一模型多线程,对服务推理速度有很好推理效果
模型分类
- 无状态模型
- CV模型;
- 均匀分配调度,动态batch;
- 有状态(推理结果基于前面的队列)
- NLP模型;
- 基于序列的调度:将同一推理队列绑定到一个推理服务实例上。
- Direct:推理队列按原有的顺序去进出;
- Oldest:推理队列内部顺序可以被打乱;
- 组合的
- pipeline:如前处理-推理-后处理,三个模型;
- 每个模型可以有自己的调度器;
动态batch调度器
流式推理请求
模型组合
framework backend和customer backend
- framework backend由其他深度学习框架定义,调用其他框架的api去实现,并且都通过C/C++ API直接调用底层GPU硬件
- PyTorch:LibTorch库,用于在两者中执行 PyTorch 模型 TorchScript 和 PyTorch 2.0 格式
- TensorRT:执行 TensorRT 模型
- TensorFlow:TensorFlow 后端用于执行 TensorFlow GraphDef 和 SavedModel 格式的模型
- ONNX Runtime:执行 ONNX 模型
- Python:Python 后端允许您编写模型逻辑 。例如,您可以使用此后端执行前/后 处理用 Python 编写的代码,或执行 PyTorch Python 脚本(而不是先将其转换为 TorchScript 和 然后使用 PyTorch 后端)
- OpenVino:执行OpenVINO模型
- DALI:DALI 是 高度优化的构建块和执行引擎 加速深度学习输入数据的预处理 应用。DALI 后端允许您执行 DALI Triton内的管道
- FIL:FIL(FIL - RAPIDS Forest Inference Library) 后端用于执行各种基于树的 ML 模型,包括 XGBoost 模型、LightGBM 模型、Scikit-Learn 随机森林模型和 cuML 随机森林模型
- vLLM:vLLM 后端旨在在 vLLM 引擎上运行受支持的模型。 此后端取决于加载和提供模型的python_backend。vllm_backend repo 包含后端的文档和源代码。
- customer backend,由用户去实现。
triton inference server使用
模型配置文件config.pbtxt
platform和backend的异同
对于Tensorrt、onnxrt、pytorch,这两种参数二选一即可。
对于TensorFlow必须指定platform,backend可选。
对于openvino,python,dali,只能使用backend。
对于Custom,21.05版本之前,可以通过platform参数设置为custom表示;之后必须通过backend字段进行指定,值为你的custom backend library的名字
max_batch_size&input&output
情况1:
max_batch_size为一个大于0的常数,Input和output指定名字,数据类型,数据形状。
**注意:**dims在指定的时候忽略batch_size的维度。
情况2:
max_batch_size等于0。表示模型的输入和输出是不包括batch_size那个维度的。
这个时候维度信息就是真实的维度信息。
情况3:
pytorch特殊情况,torchscript模型不保存输入输出的名字,因此对输入输出名称有特殊规定,“字符串__数字”。
支持可变shape,设置为-1。
情况4:
reshape参数:对输入输出进行reshape
**模型版本管理——**version_policy
version_policy参数,策略:
all:加载所有版本的模型。
latest:加载最新的模型(可多个,版本号越大越新)
specific:指定特定的版本。
Instance Groups(多个实例对应多gpu选择)
对应triton的并行计算能力特性,这个参数主要用来配置在指定设备上运行多个实例,提高模型服务能力,增加吞吐。
Instance Groups配置跑在同样设备上的一组模型实例。
count:同时开启的模型数量。
kind:指定设备类型。
gpus:指定GPU编号,如果不指定这个参数,triton会在每个GPU上跑相应数量的instance。(指定后会根据count数量在每个gpu上运行count数量实例)
可配置多组。
Scheduling And Batching调度策略
Default Scheduler
- 不做batching;
- 输入进来是多少就按照多少去推理;
Dynamic Batcher
- 在服务端将多少个batch_size比较小的input_tensor合并为一个batch_size比较大的input_tensor;
- 提高吞吐率的关键手段;
- 只适合无状态模型;
子参数:
preferr_batch_size: 期望达到的batch_size是多少,多个值;
max_queue_delay_microseconds: 打成batch的时间限制,微秒;
高级子参数:
preserver_ordering: 请求进来的顺序和响应出去的顺序保持一致;
priority_levels: 定义不同优先级请求处理顺序;
Queue_Policy: 设置请求等待队列行为;
Sequence Batcher
- 专门用于stateful model的一种调度器;
- 确保同一序列的推理请求能够路由到同样的模型实例上推理;
Ensemble Scheduler
- 组合不同的模块,形成pipeline;
Optimization Policy
- Onnx模型优化——TRT backend for ONNX;
- TensorFlow模型优化——TF-TRT;
Model Warmup
指定模型热身的参数;
- 初始化可能延迟,直到收到前面几个推理请求;
- 热身完成后,Triton的服务才是Ready状态;
- 模型加载会变长
组合模型配置(ensemble model)
图片-预处理模型-分别进入不同的模型-输出不同的结果。
定义模型的输入输出,然后在ensemble_scheduling中定义不同的步骤,其中step中的key是本身的input/output tensor的名字;value是ensemble model中的Tensor名字。
配置写完后,在ensemble_model的目录只能够新建一个版本目录,里面为空,然后放config文件。
注意事项:
- 如果组合里有一个是stateful 模型,那么整个pipeline都成为stateful 模型,推理请求需要符合stateful model的规则。
- 每个子模块有各自的调度器。
- 如果每个子模块都是framework backend,则传输使用GPU进行,否则可能通过cpu 内存。
explame1
explame2
发送请求到Triton服务
同步方式推理:
异步方式推理:
通过share memory去推理(适合client和server在同一部机器上的情况。):
triton命令
服务端启动命令
--log-verbose <integer>: 开启verbose日志信息。
--strict-model-config <boolean>: 是否需要配置模型。
--strict-readiness <boolean>: ready状态显示状况。
--exit-on-error <boolean>: 模型加载部分失败,是否也启动。
--http-port <integer>: 指定http服务端口,默认是8000。
--grpc-port <integer>: 指定GRPC服务端口,默认是8001。
--metrics-port <integer>: metrics报告端口,默认8002。
--model-control-mode <string>: 模型管理模式,默认是none,把模型库中所有模型都load进来,并且无法动态卸载或者更新。explicit,server启动时不加载模型,可以通过api进行加载或者卸载模型;poll,动态更新模型,增加新的版本或者修改配置,服务都会动态去加载模型。
--repository-poll-secs <integer>: 模型控制模式为poll时,自动检查模型库变动的时间。
--load-model <string>: 模型控制模式为explicit时指定启动时加载的模型。
--pinned-memory-pool-byte-size <integer>:可以被Triton服务使用的锁页内存大小,关于锁页内存可以参考:https://cloud.tencent.com/developer/article/2000487。
--cuda-memory-pool-byte-size <<integer>:<integer>>:可以被Triton使用的cuda memory 大小。
--backend-directory <string>: backend搜索路径,可在使用custom backend的时候指定自己的库。
--repoagent-directory <string>:预处理模型库的库,譬如在load模型的时候进行加密。
#以models为模型仓库启动
tritonserver --model-repository=/models
#以modles为模型仓库,是动态加载和卸载模型,也就是刚开始不加载模型
tritonserver --model-repository=/models --strict-model-config --model-control-mode explicit
#展示模型加载的日志
tritonserver --model-repository=/models --log-verbose 1(层级,可变)
#对于TensorRT, TensorFlow saved-model和Onnx model模型通过该参数改变config.pbtxt配置文件的必要性
tritonserver --model-repository=/models --strict-model-config=true
#检查所有模型是否ready
--strict-readiness
#如果设置成true,所有模型启动,server才能启动。flase就不用
--exit-on-error
#指定监听端口号
tritonserver --model-repo=/models --http-port 8003 --grpc-port 8004 --metrics-port 8005
# 自动检查模型是否有新更新的时间间隔, 模型库控制方式为poll的时候才有效
--reposity-poll-secs
#triton server能够分配的所有pinned的cpu内存大小
--pinned-memory-pool-byte-size
#可以分配的最大的cuda memory的大小, 默认时64M
--cuda-memory-pool-byte-size
#自己指定backends的存放位置
--backend-directory
#用来预处理模型库的程序,比如模型库load进去的时候做一个加密操作,就可以把加密的程序做一个repoagent放到这个目录下面,然后指定这个参数
--repoagent-directory
#以动态加载模型方式启动服务端
docker run --gpus=1 --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models --model-control-mode explicit
#不启动triton的方式进入容器内
docker run --gpus=1 --net=host -v ${PWD}/model_repository:/models -it nvcr.io/nvidia/tritonserver:24.07-py3 bash
客户端请求命令
docker run -it --net=host -v ${PWD}/server-main:/server-main nvcr.io/nvidia/tritonserver:24.07-py3-sdk
pip install opencv-python -i https://pypi.tuna.tsinghua.edu.cn/simple
#查看server是否加载完成
curl http://localhost:8000/v2/health/ready
#指定某一个模型进行加载
curl -X POST http://localhost:8000/v2/repository/models/yolov8_torch/load
docker版本
# Step 1: Create the example model repository
git clone -b r24.07 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh
# Step 2: Launch triton from the NGC Triton container,tritonserver启动后默认使用挂载的模型进行加载
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 tritonserver --model-repository=/models
# Step 3: Sending an Inference Request,客户端容器配置自动删除
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.07-py3-sdk
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
# Inference should return the following
Image '/workspace/images/mug.jpg':
15.346230 (504) = COFFEE MUG
13.224326 (968) = CUP
10.422965 (505) = COFFEEPOT
backend
实现backend
- 三个虚拟类;
- 七个接口函数;
- 两个状态类;
两个状态类的具体作用:
创建和调用流程:
设计原因
- 需要一种统一框架来应对所有backend;
- excute()可能给不同的实例同时调用,需要确保所有不同的调用是独立的,安全的;
- 解耦backend实现和Triton主流程代码,这样实现backend的时候只需要编译自己的backend代码即可;
perf analyzer使用
对于支持批处理的模型,使用 -b
选项来 指示 Perf Analyzer 应发送的请求的批处理大小。为具有可变大小输入的模型,则必须提供 --shape
参数,以便 Perf Analyzer 知道 shape 张量。例如,对于具有名为 shape 的输入的模型,其中 和 是可变大小的 维度,以指示 Perf Analyzer 发送 shape 为的批量大小 4 个请求:IMAGE``[3, N, M]``N``M``[3, 224, 224]
#该分析器跟随server端
perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224
perf_analyzer -m yolov8_trt -b 1 --shape IMAGE:3,640,640
-m 指定检测模型
-b 指定发送的批处理大小
--shape 为可变大小输入的模型提供张量shape
model_analyzer使用
文档:https://github.com/triton-inference-server/model_analyzer
1.#不启动triton的方式进入容器内
docker run --gpus=1 --net=host -v ${PWD}/model_repository:/models -it nvcr.io/nvidia/tritonserver:24.07-py3 bash
2.运行model_analyzer得到结果
root@zhouhaojie-System-Product-Name:/# model-analyzer profile --model-repository /models --profile-models yolov8_torch --output-model-repository-path /yolov8_torch
[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA GeForce RTX 3090 with UUID GPU-e6de677b-484c-ebd5-b4b1-dc435a7f9c1b
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] No checkpoint file found, starting a fresh run.
[Model Analyzer] Profiling server only metrics...
[Model Analyzer]
[Model Analyzer] Starting automatic brute search
[Model Analyzer]
[Model Analyzer] Creating model config: yolov8_torch_config_default
[Model Analyzer]
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=1
[Model Analyzer] perf_analyzer's request count is too small, increased to 100.
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=2
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=4
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=8
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=16
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=32
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profiling yolov8_torch_config_default: concurrency=64
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] No longer increasing concurrency as throughput has plateaued
[Model Analyzer] Creating model config: yolov8_torch_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
.....................
[Model Analyzer] Saved checkpoint to /checkpoints/0.ckpt
[Model Analyzer] Profile complete. Profiled 20 configurations for models: ['yolov8_torch']
[Model Analyzer]
[Model Analyzer] Exporting server only metrics to /results/metrics-server-only.csv
[Model Analyzer] Exporting inference metrics to /results/metrics-model-inference.csv
[Model Analyzer] Exporting GPU metrics to /results/metrics-model-gpu.csv
[Model Analyzer] Warning: html reports are being generated instead of pdf because wkhtmltopdf is not installed. If you want pdf reports, run the following command and then rerun Model Analyzer: "sudo apt-get update && sudo apt-get install wkhtmltopdf"
[Model Analyzer] Exporting Summary Report to /reports/summaries/yolov8_torch/result_summary.html
[Model Analyzer] Generating detailed reports for the best configurations yolov8_torch_config_13,yolov8_torch_config_17,yolov8_torch_config_10:
[Model Analyzer] Exporting Detailed Report to /reports/detailed/yolov8_torch_config_13/detailed_report.html
[Model Analyzer] Exporting Detailed Report to /reports/detailed/yolov8_torch_config_17/detailed_report.html
[Model Analyzer] Exporting Detailed Report to /reports/detailed/yolov8_torch_config_10/detailed_report.html
3.#下载wkhtmltopdf工具
sudo apt-get install wkhtmltopdf
4.将html文件转换成pdf进行查看
wkhtmltopdf result_summary.html result_summary.pdf
可得类似结果:
triton inference server部署yolov8n
yolov8_onnx
git clone https://github.com/wxk-cmd/yolov8_onnx_triton.git
yolov8_trt
git clone https://github.com/wxk-cmd/yolov8_trt_triton.git
yolov8_torch
git clone https://github.com/wxk-cmd/yolov8_torch_trion.git
三种模型测试(GRPC通信,三种模型加载)
torch
b1
c1
Request concurrency: 1
Client:
Request count: 3522
Throughput: 195.59 infer/sec
Avg latency: 4862 usec (standard deviation 228 usec)
p50 latency: 4803 usec
p90 latency: 4994 usec
p95 latency: 5434 usec
p99 latency: 5711 usec
Avg HTTP time: 4727 usec (send/recv 1336 usec + response wait 3391 usec)
Server:
Inference count: 3521
Execution count: 3521
Successful request count: 3521
Avg request latency: 2579 usec (overhead 21 usec + queue 51 usec + compute input 367 usec + compute infer 1874 usec + compute output 264 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 195.59 infer/sec, latency 4862 usec
c2
Request concurrency: 1
Client:
Request count: 3169
Throughput: 175.842 infer/sec
Avg latency: 5451 usec (standard deviation 1093 usec)
p50 latency: 4890 usec
p90 latency: 6556 usec
p95 latency: 6938 usec
p99 latency: 10478 usec
Avg HTTP time: 5320 usec (send/recv 1558 usec + response wait 3762 usec)
Server:
Inference count: 3168
Execution count: 3168
Successful request count: 3168
Avg request latency: 2708 usec (overhead 24 usec + queue 53 usec + compute input 395 usec + compute infer 1955 usec + compute output 280 usec)
Request concurrency: 2
Client:
Request count: 4437
Throughput: 246.374 infer/sec
Avg latency: 7825 usec (standard deviation 204 usec)
p50 latency: 7835 usec
p90 latency: 8015 usec
p95 latency: 8065 usec
p99 latency: 8189 usec
Avg HTTP time: 7655 usec (send/recv 1888 usec + response wait 5767 usec)
Server:
Inference count: 4437
Execution count: 4437
Successful request count: 4437
Avg request latency: 3567 usec (overhead 21 usec + queue 32 usec + compute input 659 usec + compute infer 2509 usec + compute output 345 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 175.842 infer/sec, latency 5451 usec
Concurrency: 2, throughput: 246.374 infer/sec, latency 7825 usec
b2
c1
Request concurrency: 1
Client:
Request count: 1919
Throughput: 213.164 infer/sec
Avg latency: 9134 usec (standard deviation 802 usec)
p50 latency: 8881 usec
p90 latency: 9649 usec
p95 latency: 11391 usec
p99 latency: 11814 usec
Avg HTTP time: 8769 usec (send/recv 2649 usec + response wait 6120 usec)
Server:
Inference count: 3836
Execution count: 1918
Successful request count: 1918
Avg request latency: 4383 usec (overhead 27 usec + queue 59 usec + compute input 795 usec + compute infer 3022 usec + compute output 479 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 213.164 infer/sec, latency 9134 usec
c2
Request concurrency: 1
Client:
Request count: 1850
Throughput: 205.49 infer/sec
Avg latency: 9479 usec (standard deviation 739 usec)
p50 latency: 9357 usec
p90 latency: 10262 usec
p95 latency: 11555 usec
p99 latency: 11968 usec
Avg HTTP time: 9118 usec (send/recv 2833 usec + response wait 6285 usec)
Server:
Inference count: 3700
Execution count: 1850
Successful request count: 1850
Avg request latency: 4425 usec (overhead 27 usec + queue 60 usec + compute input 787 usec + compute infer 3027 usec + compute output 522 usec)
Request concurrency: 2
Client:
Request count: 1964
Throughput: 218.13 infer/sec
Avg latency: 18045 usec (standard deviation 609 usec)
p50 latency: 18088 usec
p90 latency: 18466 usec
p95 latency: 18568 usec
p99 latency: 18820 usec
Avg HTTP time: 17675 usec (send/recv 3713 usec + response wait 13962 usec)
Server:
Inference count: 3930
Execution count: 1965
Successful request count: 1965
Avg request latency: 6433 usec (overhead 22 usec + queue 51 usec + compute input 1428 usec + compute infer 4273 usec + compute output 659 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 205.49 infer/sec, latency 9479 usec
Concurrency: 2, throughput: 218.13 infer/sec, latency 18045 usec
b4
c1
Request concurrency: 1
Client:
Request count: 885
Throughput: 196.59 infer/sec
Avg latency: 20128 usec (standard deviation 858 usec)
p50 latency: 19880 usec
p90 latency: 20451 usec
p95 latency: 22474 usec
p99 latency: 23379 usec
Avg HTTP time: 19459 usec (send/recv 7402 usec + response wait 12057 usec)
Server:
Inference count: 3540
Execution count: 885
Successful request count: 885
Avg request latency: 7674 usec (overhead 27 usec + queue 62 usec + compute input 1704 usec + compute infer 4890 usec + compute output 990 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 196.59 infer/sec, latency 20128 usec
c2
Request concurrency: 1
Client:
Request count: 884
Throughput: 196.379 infer/sec
Avg latency: 20145 usec (standard deviation 735 usec)
p50 latency: 19963 usec
p90 latency: 20354 usec
p95 latency: 22392 usec
p99 latency: 23453 usec
Avg HTTP time: 19519 usec (send/recv 7457 usec + response wait 12062 usec)
Server:
Inference count: 3536
Execution count: 884
Successful request count: 884
Avg request latency: 7676 usec (overhead 28 usec + queue 56 usec + compute input 1700 usec + compute infer 4902 usec + compute output 989 usec)
Request concurrency: 2
Client:
Request count: 996
Throughput: 221.265 infer/sec
Avg latency: 35976 usec (standard deviation 734 usec)
p50 latency: 36025 usec
p90 latency: 36607 usec
p95 latency: 36786 usec
p99 latency: 37260 usec
Avg HTTP time: 35355 usec (send/recv 15071 usec + response wait 20284 usec)
Server:
Inference count: 3984
Execution count: 996
Successful request count: 996
Avg request latency: 11740 usec (overhead 23 usec + queue 41 usec + compute input 2743 usec + compute infer 7670 usec + compute output 1263 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 196.379 infer/sec, latency 20145 usec
Concurrency: 2, throughput: 221.265 infer/sec, latency 35976 usec
b8
c1
Request concurrency: 1
Client:
Request count: 412
Throughput: 182.976 infer/sec
Avg latency: 43492 usec (standard deviation 1523 usec)
p50 latency: 43168 usec
p90 latency: 44856 usec
p95 latency: 47426 usec
p99 latency: 49197 usec
Avg HTTP time: 42097 usec (send/recv 19000 usec + response wait 23097 usec)
Server:
Inference count: 3296
Execution count: 412
Successful request count: 412
Avg request latency: 18515 usec (overhead 28 usec + queue 58 usec + compute input 3266 usec + compute infer 8638 usec + compute output 6523 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 182.976 infer/sec, latency 43492 usec
c2
Request concurrency: 1
Client:
Request count: 408
Throughput: 181.182 infer/sec
Avg latency: 43945 usec (standard deviation 1392 usec)
p50 latency: 43671 usec
p90 latency: 44986 usec
p95 latency: 47748 usec
p99 latency: 49460 usec
Avg HTTP time: 42541 usec (send/recv 19310 usec + response wait 23231 usec)
Server:
Inference count: 3264
Execution count: 408
Successful request count: 408
Avg request latency: 18625 usec (overhead 30 usec + queue 51 usec + compute input 3269 usec + compute infer 8657 usec + compute output 6617 usec)
Request concurrency: 2
Client:
Request count: 468
Throughput: 207.889 infer/sec
Avg latency: 76599 usec (standard deviation 2726 usec)
p50 latency: 76549 usec
p90 latency: 80184 usec
p95 latency: 81118 usec
p99 latency: 81621 usec
Avg HTTP time: 75446 usec (send/recv 39362 usec + response wait 36084 usec)
Server:
Inference count: 3760
Execution count: 470
Successful request count: 470
Avg request latency: 27179 usec (overhead 21 usec + queue 54 usec + compute input 5414 usec + compute infer 14454 usec + compute output 7236 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 181.182 infer/sec, latency 43945 usec
Concurrency: 2, throughput: 207.889 infer/sec, latency 76599 usec
onnx
b1
c1
Request concurrency: 1
Client:
Request count: 3335
Throughput: 185.224 infer/sec
Avg latency: 5142 usec (standard deviation 221 usec)
p50 latency: 5093 usec
p90 latency: 5235 usec
p95 latency: 5677 usec
p99 latency: 6034 usec
Avg HTTP time: 5007 usec (send/recv 1315 usec + response wait 3692 usec)
Server:
Inference count: 3335
Execution count: 3335
Successful request count: 3335
Avg request latency: 2848 usec (overhead 18 usec + queue 46 usec + compute input 235 usec + compute infer 2407 usec + compute output 141 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 185.224 infer/sec, latency 5142 usec
c2
Request concurrency: 1
Client:
Request count: 5175
Throughput: 142.777 infer/sec
Avg latency: 6222 usec (standard deviation 2663 usec)
p50 latency: 5918 usec
p90 latency: 6161 usec
p95 latency: 7750 usec
p99 latency: 19400 usec
Avg HTTP time: 5760 usec (send/recv 1475 usec + response wait 4285 usec)
Server:
Inference count: 5175
Execution count: 5175
Successful request count: 5175
Avg request latency: 3305 usec (overhead 20 usec + queue 43 usec + compute input 263 usec + compute infer 2802 usec + compute output 176 usec)
Request concurrency: 2
Client:
Request count: 8174
Throughput: 226.95 infer/sec
Avg latency: 8525 usec (standard deviation 379 usec)
p50 latency: 8510 usec
p90 latency: 8815 usec
p95 latency: 8948 usec
p99 latency: 9317 usec
Avg HTTP time: 8352 usec (send/recv 1967 usec + response wait 6385 usec)
Server:
Inference count: 8174
Execution count: 8174
Successful request count: 8174
Avg request latency: 4131 usec (overhead 17 usec + queue 35 usec + compute input 427 usec + compute infer 3441 usec + compute output 210 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 142.777 infer/sec, latency 6222 usec
Concurrency: 2, throughput: 226.95 infer/sec, latency 8525 usec
b2
c1
Request concurrency: 1
Client:
Request count: 1883
Throughput: 209.169 infer/sec
Avg latency: 9325 usec (standard deviation 879 usec)
p50 latency: 9061 usec
p90 latency: 10078 usec
p95 latency: 11970 usec
p99 latency: 12238 usec
Avg HTTP time: 8942 usec (send/recv 2683 usec + response wait 6259 usec)
Server:
Inference count: 3768
Execution count: 1884
Successful request count: 1884
Avg request latency: 4508 usec (overhead 21 usec + queue 52 usec + compute input 509 usec + compute infer 3647 usec + compute output 277 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 209.169 infer/sec, latency 9325 usec
c2
Request concurrency: 1
Client:
Request count: 3484
Throughput: 193.52 infer/sec
Avg latency: 10073 usec (standard deviation 927 usec)
p50 latency: 10078 usec
p90 latency: 10812 usec
p95 latency: 12007 usec
p99 latency: 12882 usec
Avg HTTP time: 9694 usec (send/recv 3004 usec + response wait 6690 usec)
Server:
Inference count: 6968
Execution count: 3484
Successful request count: 3484
Avg request latency: 4688 usec (overhead 25 usec + queue 59 usec + compute input 531 usec + compute infer 3739 usec + compute output 334 usec)
Request concurrency: 2
Client:
Request count: 3944
Throughput: 218.987 infer/sec
Avg latency: 17989 usec (standard deviation 1699 usec)
p50 latency: 18679 usec
p90 latency: 19064 usec
p95 latency: 19188 usec
p99 latency: 19526 usec
Avg HTTP time: 17637 usec (send/recv 3724 usec + response wait 13913 usec)
Server:
Inference count: 7884
Execution count: 3942
Successful request count: 3942
Avg request latency: 6904 usec (overhead 19 usec + queue 45 usec + compute input 915 usec + compute infer 5511 usec + compute output 413 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 193.52 infer/sec, latency 10073 usec
Concurrency: 2, throughput: 218.987 infer/sec, latency 17989 usec
b4
c1
Request concurrency: 1
Client:
Request count: 892
Throughput: 198.153 infer/sec
Avg latency: 19964 usec (standard deviation 748 usec)
p50 latency: 19797 usec
p90 latency: 20178 usec
p95 latency: 22129 usec
p99 latency: 22984 usec
Avg HTTP time: 19308 usec (send/recv 7404 usec + response wait 11904 usec)
Server:
Inference count: 3568
Execution count: 892
Successful request count: 892
Avg request latency: 7551 usec (overhead 23 usec + queue 50 usec + compute input 1200 usec + compute infer 5736 usec + compute output 541 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 198.153 infer/sec, latency 19964 usec
c2
Request concurrency: 1
Client:
Request count: 1774
Throughput: 197.07 infer/sec
Avg latency: 20076 usec (standard deviation 853 usec)
p50 latency: 19890 usec
p90 latency: 20354 usec
p95 latency: 22358 usec
p99 latency: 23410 usec
Avg HTTP time: 19412 usec (send/recv 7451 usec + response wait 11961 usec)
Server:
Inference count: 7092
Execution count: 1773
Successful request count: 1773
Avg request latency: 7599 usec (overhead 23 usec + queue 56 usec + compute input 1202 usec + compute infer 5766 usec + compute output 551 usec)
Request concurrency: 2
Client:
Request count: 2028
Throughput: 225.281 infer/sec
Avg latency: 35282 usec (standard deviation 924 usec)
p50 latency: 35344 usec
p90 latency: 36142 usec
p95 latency: 36647 usec
p99 latency: 37496 usec
Avg HTTP time: 34659 usec (send/recv 14934 usec + response wait 19725 usec)
Server:
Inference count: 8112
Execution count: 2028
Successful request count: 2028
Avg request latency: 11434 usec (overhead 19 usec + queue 38 usec + compute input 1689 usec + compute infer 8924 usec + compute output 763 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 197.07 infer/sec, latency 20076 usec
Concurrency: 2, throughput: 225.281 infer/sec, latency 35282 usec
b8
c1
Request concurrency: 1
Client:
Request count: 416
Throughput: 184.832 infer/sec
Avg latency: 43102 usec (standard deviation 1572 usec)
p50 latency: 42577 usec
p90 latency: 44510 usec
p95 latency: 47419 usec
p99 latency: 48964 usec
Avg HTTP time: 41664 usec (send/recv 18710 usec + response wait 22954 usec)
Server:
Inference count: 3320
Execution count: 415
Successful request count: 415
Avg request latency: 18408 usec (overhead 22 usec + queue 38 usec + compute input 2548 usec + compute infer 10145 usec + compute output 5654 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 184.832 infer/sec, latency 43102 usec
c2
Request concurrency: 1
Client:
Request count: 822
Throughput: 182.588 infer/sec
Avg latency: 43603 usec (standard deviation 1584 usec)
p50 latency: 43176 usec
p90 latency: 44702 usec
p95 latency: 47756 usec
p99 latency: 49581 usec
Avg HTTP time: 42148 usec (send/recv 18939 usec + response wait 23209 usec)
Server:
Inference count: 6576
Execution count: 822
Successful request count: 822
Avg request latency: 18627 usec (overhead 24 usec + queue 46 usec + compute input 2559 usec + compute infer 10220 usec + compute output 5777 usec)
Request concurrency: 2
Client:
Request count: 947
Throughput: 210.385 infer/sec
Avg latency: 75793 usec (standard deviation 1640 usec)
p50 latency: 75636 usec
p90 latency: 77570 usec
p95 latency: 78188 usec
p99 latency: 79471 usec
Avg HTTP time: 74602 usec (send/recv 39123 usec + response wait 35479 usec)
Server:
Inference count: 7576
Execution count: 947
Successful request count: 947
Avg request latency: 26491 usec (overhead 19 usec + queue 37 usec + compute input 3787 usec + compute infer 16516 usec + compute output 6132 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 182.588 infer/sec, latency 43603 usec
Concurrency: 2, throughput: 210.385 infer/sec, latency 75793 usec
tensorrt_FP16
b1
c1
Request concurrency: 1
Client:
Request count: 4260
Throughput: 236.573 infer/sec
Avg latency: 3984 usec (standard deviation 186 usec)
p50 latency: 3934 usec
p90 latency: 4107 usec
p95 latency: 4519 usec
p99 latency: 4607 usec
Avg HTTP time: 3848 usec (send/recv 1276 usec + response wait 2572 usec)
Server:
Inference count: 4259
Execution count: 4259
Successful request count: 4259
Avg request latency: 1790 usec (overhead 11 usec + queue 21 usec + compute input 379 usec + compute infer 1107 usec + compute output 271 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 236.573 infer/sec, latency 3984 usec
c2
Request concurrency: 1
Client:
Request count: 4298
Throughput: 238.678 infer/sec
Avg latency: 3942 usec (standard deviation 220 usec)
p50 latency: 3885 usec
p90 latency: 4050 usec
p95 latency: 4502 usec
p99 latency: 4667 usec
Avg HTTP time: 3806 usec (send/recv 1301 usec + response wait 2505 usec)
Server:
Inference count: 4298
Execution count: 4298
Successful request count: 4298
Avg request latency: 1709 usec (overhead 10 usec + queue 31 usec + compute input 370 usec + compute infer 1023 usec + compute output 274 usec)
Request concurrency: 2
Client:
Request count: 6379
Throughput: 353.937 infer/sec
Avg latency: 5279 usec (standard deviation 653 usec)
p50 latency: 5337 usec
p90 latency: 5660 usec
p95 latency: 6378 usec
p99 latency: 6958 usec
Avg HTTP time: 5043 usec (send/recv 1414 usec + response wait 3629 usec)
Server:
Inference count: 6380
Execution count: 6348
Successful request count: 6380
Avg request latency: 1934 usec (overhead 12 usec + queue 159 usec + compute input 409 usec + compute infer 1045 usec + compute output 307 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 238.678 infer/sec, latency 3942 usec
Concurrency: 2, throughput: 353.937 infer/sec, latency 5279 usec
b2
c1
Request concurrency: 1
Client:
Request count: 2404
Throughput: 267.045 infer/sec
Avg latency: 7257 usec (standard deviation 382 usec)
p50 latency: 7162 usec
p90 latency: 7310 usec
p95 latency: 8611 usec
p99 latency: 8769 usec
Avg HTTP time: 6955 usec (send/recv 2558 usec + response wait 4397 usec)
Server:
Inference count: 4810
Execution count: 2405
Successful request count: 2405
Avg request latency: 2750 usec (overhead 16 usec + queue 51 usec + compute input 813 usec + compute infer 1355 usec + compute output 514 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 267.045 infer/sec, latency 7257 usec
c2
Request concurrency: 1
Client:
Request count: 2461
Throughput: 273.368 infer/sec
Avg latency: 7085 usec (standard deviation 436 usec)
p50 latency: 6977 usec
p90 latency: 7204 usec
p95 latency: 8503 usec
p99 latency: 8722 usec
Avg HTTP time: 6767 usec (send/recv 2553 usec + response wait 4214 usec)
Server:
Inference count: 4924
Execution count: 2462
Successful request count: 2462
Avg request latency: 2634 usec (overhead 16 usec + queue 52 usec + compute input 798 usec + compute infer 1240 usec + compute output 528 usec)
Request concurrency: 2
Client:
Request count: 3365
Throughput: 373.8 infer/sec
Avg latency: 10484 usec (standard deviation 1881 usec)
p50 latency: 10084 usec
p90 latency: 12727 usec
p95 latency: 15552 usec
p99 latency: 17778 usec
Avg HTTP time: 10001 usec (send/recv 2970 usec + response wait 7031 usec)
Server:
Inference count: 6730
Execution count: 3302
Successful request count: 3365
Avg request latency: 3190 usec (overhead 17 usec + queue 201 usec + compute input 965 usec + compute infer 1410 usec + compute output 597 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 273.368 infer/sec, latency 7085 usec
Concurrency: 2, throughput: 373.8 infer/sec, latency 10484 usec
b4
c1
Request concurrency: 1
Client:
Request count: 1053
Throughput: 233.934 infer/sec
Avg latency: 16870 usec (standard deviation 848 usec)
p50 latency: 16685 usec
p90 latency: 17229 usec
p95 latency: 19008 usec
p99 latency: 19969 usec
Avg HTTP time: 16252 usec (send/recv 7318 usec + response wait 8934 usec)
Server:
Inference count: 4212
Execution count: 1053
Successful request count: 1053
Avg request latency: 4682 usec (overhead 18 usec + queue 69 usec + compute input 1742 usec + compute infer 1837 usec + compute output 1015 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 233.934 infer/sec, latency 16870 usec
c2
Request concurrency: 1
Client:
Request count: 1052
Throughput: 233.688 infer/sec
Avg latency: 16911 usec (standard deviation 842 usec)
p50 latency: 16718 usec
p90 latency: 17235 usec
p95 latency: 19353 usec
p99 latency: 20623 usec
Avg HTTP time: 16270 usec (send/recv 7345 usec + response wait 8925 usec)
Server:
Inference count: 4204
Execution count: 1051
Successful request count: 1051
Avg request latency: 4546 usec (overhead 21 usec + queue 90 usec + compute input 1705 usec + compute infer 1701 usec + compute output 1029 usec)
Request concurrency: 2
Client:
Request count: 1294
Throughput: 287.364 infer/sec
Avg latency: 27631 usec (standard deviation 2043 usec)
p50 latency: 28183 usec
p90 latency: 28992 usec
p95 latency: 30069 usec
p99 latency: 32945 usec
Avg HTTP time: 26650 usec (send/recv 13590 usec + response wait 13060 usec)
Server:
Inference count: 5176
Execution count: 1287
Successful request count: 1294
Avg request latency: 5292 usec (overhead 19 usec + queue 312 usec + compute input 1892 usec + compute infer 1898 usec + compute output 1170 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 233.688 infer/sec, latency 16911 usec
Concurrency: 2, throughput: 287.364 infer/sec, latency 27631 usec
b8
c1
Request concurrency: 1
Client:
Request count: 468
Throughput: 207.851 infer/sec
Avg latency: 38278 usec (standard deviation 1703 usec)
p50 latency: 37936 usec
p90 latency: 39269 usec
p95 latency: 42535 usec
p99 latency: 44197 usec
Avg HTTP time: 36864 usec (send/recv 19105 usec + response wait 17759 usec)
Server:
Inference count: 3744
Execution count: 468
Successful request count: 468
Avg request latency: 13225 usec (overhead 23 usec + queue 84 usec + compute input 3365 usec + compute infer 3165 usec + compute output 6587 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 207.851 infer/sec, latency 38278 usec
c2
Request concurrency: 1
Client:
Request count: 539
Throughput: 239.261 infer/sec
Avg latency: 33185 usec (standard deviation 1699 usec)
p50 latency: 32559 usec
p90 latency: 34941 usec
p95 latency: 37564 usec
p99 latency: 39708 usec
Avg HTTP time: 31713 usec (send/recv 18611 usec + response wait 13102 usec)
Server:
Inference count: 4312
Execution count: 539
Successful request count: 539
Avg request latency: 8506 usec (overhead 24 usec + queue 116 usec + compute input 3280 usec + compute infer 3023 usec + compute output 2062 usec)
Request concurrency: 2
Client:
Request count: 705
Throughput: 313.003 infer/sec
Avg latency: 50868 usec (standard deviation 3212 usec)
p50 latency: 50277 usec
p90 latency: 54319 usec
p95 latency: 58798 usec
p99 latency: 62290 usec
Avg HTTP time: 49461 usec (send/recv 30521 usec + response wait 18940 usec)
Server:
Inference count: 5640
Execution count: 705
Successful request count: 705
Avg request latency: 10355 usec (overhead 24 usec + queue 161 usec + compute input 3607 usec + compute infer 3072 usec + compute output 3490 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 239.261 infer/sec, latency 33185 usec
Concurrency: 2, throughput: 313.003 infer/sec, latency 50868 usec
吞吐量
c1
模型 | b=1 | b=2 | b=4 | b=8 |
---|---|---|---|---|
torch | 195.59 | 213.164 | 196.59 | 182.976 |
onnx | 185.224 | 209.169 | 198.153 | 184.832 |
tensorrt | 236.573 | 267.045 | 233.934 | 207.851 |
c2
模型 | b=1 | b=2 | b=4 | b=8 |
---|---|---|---|---|
torch | 175.842/246.374 | 205.49/218.13 | 196.379/221.265 | 181.182/207.889 |
onnx | 142.777/226.95? | 193.52/218.987 | 197.07/225.281 | 182.588/210.385 |
tensorrt | 238.678/353.937 | 273.368/373.8 | 233.688/287.364 | 239.261/313.003 |
延迟(服务端平均处理延迟与客户端观察到的平均延迟)
c1
模型 | b=1 | b=2 | b=4 | b=8 |
---|---|---|---|---|
torch | 2579 usec/4862 usec | 4383 usec/9134 usec | 7674 usec/20128 usec | 18515 usec/43492 usec |
onnx | 2848 usec/5142 usec | 4508 usec/9325 usec | 7551 usec/19964 usec | 18408 usec/43102 usec |
tensorrt | 1790usec/3984 usec | 2750 usec/7257 usec | 4682 usec/16870usec | 13225 usec/ 38278 usec |
c2
模型 | b=1 | b=2 | b=4 | b=8 |
---|---|---|---|---|
torch | 2708usec/5451usec、3567usec/7825usec | 4425usec/9479usec、6433usec/18045usec | 7676usec/20145usec、11740usec/35976usec | 18625usec/43945usec、27179usec/76599usec |
onnx | 3305usec/6222usec、4131usec/8525usec | 4688usec/10073usec、6904usec/17989usec | 7599usec/20076usec、11434usec/35282usec | 18627usec/43603usec、26491usec/75793usec |
tensorrt | 1709 usec/3942 usec、1934usec/5279usec | 2634 usec/7085 usec、3190usec/10484 usec | 4546usec/16911usec、5292usec/27631usec | 8506usec/33185usec、10355usec/50868usec |