1.CV-CUDA简介
NVIDIA CV- cuda™是一个开源项目,用于构建云级人工智能(AI)成像和计算机视觉(CV)应用程序。它使用图形处理单元(GPU)加速来帮助开发人员构建高效的预处理和后处理管道。它可以将吞吐量提高10倍以上,同时降低云计算成本。
代码库地址:https://github.com/CVCUDA/CV-CUDA/
在线文档地址:https://cvcuda.github.io/
2.安装CV-CUDA
cv-cuda安装对系统、cuda版本与驱动版本皆有要求:
Ubuntu >= 20.04 (22.04 recommended for building the documentation)
CUDA >= 11.7 (cuda 12 required for samples)
NVIDIA driver r525 or later (r535 required for samples)
以下有两种安装CV-CUDA的方法。选择适合你环境需要的安装方法即可。
(1)安装tar包
tar -xvf cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz
tar -xvf cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz
(2)安装deb包
sudo apt install -y ./cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.deb
sudo apt install -y ./cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.deb
默认安装目录是在/opt/nvidia/cvcuda0/
2.使用CV-CUDA
对于推理一张图片来说,一般包含三部分:图片前处理、图片推理、图片后处理。一般来说,图片的前后处理都是放在cpu上来做,图片推理放在gpu上进行。推理少量图片时大家都和和美美,但当出现多个进程大量图片去进行推理时,cpu利用率会飙升。对于推理图片的初衷,当然是越快越好,(准确性大多有保证),资源占用越少越好,对于图片推理阶段,我们可以采用TensorR进行推理加速,对于前后处理方面我们同样可以将两者放到gpu上来做,这就用到了cv-cuda。
CV-CUDA主要用于图片前处理阶段,(图片后处理多是数据的处理,可以采用核函数加速),我接下来介绍如何使用CV-CUDA进行图片的前处理操作。
在CV-CUDA中,GPU上的数据都用nvcv::Tensor来表示,图像预处理操作需要用到两个Tensor:原始输入图像Tensor和模型输入数据Tensor。这两个Tensor可以根据原始输入图像的尺寸和模型输入尺寸预先构建好:
cudaStream_t stream;
CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
// Allocating memory for input image batch
nvcv::TensorDataStridedCuda::Buffer inBuf;
inBuf.strides[3] = sizeof(uint8_t);
inBuf.strides[2] = maxChannels * inBuf.strides[3];
inBuf.strides[1] = maxImageWidth * inBuf.strides[2];
inBuf.strides[0] = maxImageHeight * inBuf.strides[1];
CHECK_CUDA_ERROR(cudaMallocAsync(&inBuf.basePtr, batchSize * inBuf.strides[0], stream));
nvcv::Tensor::Requirements inReqs
= nvcv::Tensor::CalcRequirements(1, {maxImageWidth, maxImageHeight}, nvcv::FMT_RGB8);
nvcv::TensorDataStridedCuda inData(nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
nvcv::DataType{inReqs.dtype}, inBuf);
nvcv::Tensor inTensor = TensorWrapData(inData);
//----------------------------
nvcv::Tensor::Requirements reqsInputLayer
= nvcv::Tensor::CalcRequirements(batchSize, {inputDims.width, inputDims.height}, nvcv::FMT_RGBf32p);
// Calculates the total buffer size needed based on the requirements
int64_t inputLayerSize = CalcTotalSizeBytes(nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
std::copy(reqsInputLayer.strides, reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK, bufInputLayer.strides);
// Allocate buffer size needed for the tensor
CHECK_CUDA_ERROR(cudaMalloc(&bufInputLayer.basePtr, inputLayerSize));
// Wrap the tensor as a CVCUDA tensor
nvcv::TensorDataStridedCuda inputLayerTensorData(
nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank, reqsInputLayer.layout},
nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
nvcv::Tensor inputLayerTensor = TensorWrapData(inputLayerTensorData);
定义好了之后将图片内存数据拷贝到 inTensor中
// copy image data to tensor
auto image_data =
inTensor .exportData<nvcv::TensorDataStridedCuda>();
cudaMemcpyAsync(image_data ->basePtr(), input_image.data,
image_data ->stride(0), cudaMemcpyHostToDevice,stream);
resize
下面以尺寸变换为例介绍CV-CUDA中算子的使用方法。CV-CUDA中尺寸变换对应的算子类为cvcuda::Resize,在调用算子之前需要为其构建一个Tensor保存算子输出的数据:
nvcv::Tensor resizedTensor(batchSize, {model_input_Width, model_input_Height}, nvcv::FMT_RGB8);
创建cvcuda::Resize对象resizeOp、调用()操作符。
nvcv::Tensor resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
cvcuda::Resize resizeOp;
resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);
就是这么简单,resize完的数据存放在resizedTensor中,具体怎样实现,有兴趣的读者可以去看源码,这里直介绍如何简单实用。
像CV-CUDA的其他算子大都这样设计,所以使用方式基本一样
ConvertTo
再介绍一个转数据格式的算子ConvertTo,首先创建一个tensor
nvcv::Tensor floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);
然后调用convertOp()
cvcuda::ConvertTo convertOp;
convertOp(stream, resizedTensor, floatTensor, 1.0f , 0.0f);
其他的算子不再赘述,具体操作方式如下代码:包括归一化与数据通道顺序变换。
// Resize to the dimensions of input layer of network
nvcv::Tensor resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
cvcuda::Resize resizeOp;
resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);
// Convert to data format expected by network (F32). Apply scale 1/255f.
nvcv::Tensor floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);
cvcuda::ConvertTo convertOp;
convertOp(stream, resizedTensor, floatTensor, 1.0f / 255.f, 0.0f);
// The input to the network needs to be normalized based on the mean and std deviation values
// to standardize the input data.
// Create a Tensor to store the standard deviation values for R,G,B
nvcv::Tensor::Requirements reqsScale = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
int64_t scaleBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsScale.mem}.cudaMem());
nvcv::TensorDataStridedCuda::Buffer bufScale;
std::copy(reqsScale.strides, reqsScale.strides + NVCV_TENSOR_MAX_RANK, bufScale.strides);
CHECK_CUDA_ERROR(cudaMalloc(&bufScale.basePtr, scaleBufferSize));
nvcv::TensorDataStridedCuda scaleIn(nvcv::TensorShape{reqsScale.shape, reqsScale.rank, reqsScale.layout},
nvcv::DataType{reqsScale.dtype}, bufScale);
nvcv::Tensor stddevTensor = TensorWrapData(scaleIn);
// Create a Tensor to store the mean values for R,G,B
nvcv::TensorDataStridedCuda::Buffer bufBase;
nvcv::Tensor::Requirements reqsBase = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
int64_t baseBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsBase.mem}.cudaMem());
std::copy(reqsBase.strides, reqsBase.strides + NVCV_TENSOR_MAX_RANK, bufBase.strides);
CHECK_CUDA_ERROR(cudaMalloc(&bufBase.basePtr, baseBufferSize));
nvcv::TensorDataStridedCuda baseIn(nvcv::TensorShape{reqsBase.shape, reqsBase.rank, reqsBase.layout},
nvcv::DataType{reqsBase.dtype}, bufBase);
nvcv::Tensor meanTensor = TensorWrapData(baseIn);
// Copy the values from Host to Device
// The R,G,B scale and mean will be applied to all the pixels across the batch of input images
float stddev[3] = {0.229, 0.224, 0.225};
float mean[3] = {0.485f, 0.456f, 0.406f};
auto meanData = meanTensor.exportData<nvcv::TensorDataStridedCuda>();
auto stddevData = stddevTensor.exportData<nvcv::TensorDataStridedCuda>();
// Flag to set the scale value as standard deviation i.e use 1/scale
uint32_t flags = CVCUDA_NORMALIZE_SCALE_IS_STDDEV;
CHECK_CUDA_ERROR(cudaMemcpyAsync(stddevData->basePtr(), stddev, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));
CHECK_CUDA_ERROR(cudaMemcpyAsync(meanData->basePtr(), mean, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));
nvcv::Tensor normTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);
// Normalize
cvcuda::Normalize normOp;
normOp(stream, floatTensor, meanTensor, stddevTensor, normTensor, 1.0f, 0.0f, 0.0f, flags);
// Convert the data layout from interleaved to planar
cvcuda::Reformat reformatOp;
reformatOp(stream, normTensor, outTensor);
具体函数参数读者可以去翻阅文档,简单的CV-CUDA编程便介绍到这里,做一个入门分享。
再见!