CV-CUDA初学习

最新推荐文章于 2025-06-24 09:17:13 发布

优雅的潮叭

最新推荐文章于 2025-06-24 09:17:13 发布

阅读量1.3k

点赞数 8

CC 4.0 BY-SA版权

文章标签： c++ 图像处理 YOLO 人工智能 opencv

本文链接：https://blog.csdn.net/Done_for_me/article/details/140958945

1.CV-CUDA简介

NVIDIA CV- cuda™是一个开源项目，用于构建云级人工智能(AI)成像和计算机视觉(CV)应用程序。它使用图形处理单元(GPU)加速来帮助开发人员构建高效的预处理和后处理管道。它可以将吞吐量提高10倍以上，同时降低云计算成本。
在这里插入图片描述
代码库地址：https://github.com/CVCUDA/CV-CUDA/
在线文档地址：https://cvcuda.github.io/

2.安装CV-CUDA

cv-cuda安装对系统、cuda版本与驱动版本皆有要求：

Ubuntu >= 20.04 (22.04 recommended for building the documentation)
CUDA >= 11.7 (cuda 12 required for samples)
NVIDIA driver r525 or later (r535 required for samples)

以下有两种安装CV-CUDA的方法。选择适合你环境需要的安装方法即可。

（1）安装tar包

tar -xvf cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz
tar -xvf cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.tar.xz

（2）安装deb包

sudo apt install -y ./cvcuda-lib-<x.x.x>-<cu_ver>-<arch>-linux.deb
sudo apt install -y ./cvcuda-dev-<x.x.x>-<cu_ver>-<arch>-linux.deb

默认安装目录是在/opt/nvidia/cvcuda0/

2.使用CV-CUDA

对于推理一张图片来说，一般包含三部分：图片前处理、图片推理、图片后处理。一般来说，图片的前后处理都是放在cpu上来做，图片推理放在gpu上进行。推理少量图片时大家都和和美美，但当出现多个进程大量图片去进行推理时，cpu利用率会飙升。对于推理图片的初衷，当然是越快越好，（准确性大多有保证），资源占用越少越好，对于图片推理阶段，我们可以采用TensorR进行推理加速，对于前后处理方面我们同样可以将两者放到gpu上来做，这就用到了cv-cuda。

CV-CUDA主要用于图片前处理阶段，（图片后处理多是数据的处理，可以采用核函数加速），我接下来介绍如何使用CV-CUDA进行图片的前处理操作。

在CV-CUDA中，GPU上的数据都用nvcv::Tensor来表示，图像预处理操作需要用到两个Tensor：原始输入图像Tensor和模型输入数据Tensor。这两个Tensor可以根据原始输入图像的尺寸和模型输入尺寸预先构建好：

	cudaStream_t stream;
    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));

    // Allocating memory for input image batch
    nvcv::TensorDataStridedCuda::Buffer inBuf;
    inBuf.strides[3] = sizeof(uint8_t);
    inBuf.strides[2] = maxChannels * inBuf.strides[3];
    inBuf.strides[1] = maxImageWidth * inBuf.strides[2];
    inBuf.strides[0] = maxImageHeight * inBuf.strides[1];
    CHECK_CUDA_ERROR(cudaMallocAsync(&inBuf.basePtr, batchSize * inBuf.strides[0], stream));

    nvcv::Tensor::Requirements inReqs
        = nvcv::Tensor::CalcRequirements(1, {maxImageWidth, maxImageHeight}, nvcv::FMT_RGB8);

    nvcv::TensorDataStridedCuda inData(nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
                                       nvcv::DataType{inReqs.dtype}, inBuf);
                                       
    nvcv::Tensor inTensor = TensorWrapData(inData);

    //----------------------------
	nvcv::Tensor::Requirements reqsInputLayer
        = nvcv::Tensor::CalcRequirements(batchSize, {inputDims.width, inputDims.height}, nvcv::FMT_RGBf32p);
    // Calculates the total buffer size needed based on the requirements
    int64_t inputLayerSize = CalcTotalSizeBytes(nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
    nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
    std::copy(reqsInputLayer.strides, reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK, bufInputLayer.strides);
    // Allocate buffer size needed for the tensor
    CHECK_CUDA_ERROR(cudaMalloc(&bufInputLayer.basePtr, inputLayerSize));
    // Wrap the tensor as a CVCUDA tensor
    nvcv::TensorDataStridedCuda inputLayerTensorData(
        nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank, reqsInputLayer.layout},
        nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
    nvcv::Tensor inputLayerTensor = TensorWrapData(inputLayerTensorData);

定义好了之后将图片内存数据拷贝到 inTensor中

// copy image data to tensor
  auto image_data =
      inTensor .exportData<nvcv::TensorDataStridedCuda>();
  cudaMemcpyAsync(image_data ->basePtr(), input_image.data,
             image_data ->stride(0), cudaMemcpyHostToDevice,stream);

resize
下面以尺寸变换为例介绍CV-CUDA中算子的使用方法。CV-CUDA中尺寸变换对应的算子类为cvcuda::Resize，在调用算子之前需要为其构建一个Tensor保存算子输出的数据：

nvcv::Tensor   resizedTensor(batchSize, {model_input_Width, model_input_Height}, nvcv::FMT_RGB8);

创建cvcuda::Resize对象resizeOp、调用()操作符。

    nvcv::Tensor   resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
    cvcuda::Resize resizeOp;
    resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);

就是这么简单，resize完的数据存放在resizedTensor中，具体怎样实现，有兴趣的读者可以去看源码，这里直介绍如何简单实用。
像CV-CUDA的其他算子大都这样设计，所以使用方式基本一样
ConvertTo
再介绍一个转数据格式的算子ConvertTo，首先创建一个tensor

nvcv::Tensor      floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);

然后调用convertOp()

     cvcuda::ConvertTo convertOp;
    convertOp(stream, resizedTensor, floatTensor, 1.0f , 0.0f);

其他的算子不再赘述，具体操作方式如下代码：包括归一化与数据通道顺序变换。

 // Resize to the dimensions of input layer of network
    nvcv::Tensor   resizedTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGB8);
    cvcuda::Resize resizeOp;
    resizeOp(stream, inTensor, resizedTensor, NVCV_INTERP_LINEAR);

    // Convert to data format expected by network (F32). Apply scale 1/255f.
    nvcv::Tensor      floatTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);
    cvcuda::ConvertTo convertOp;
    convertOp(stream, resizedTensor, floatTensor, 1.0f / 255.f, 0.0f);

    // The input to the network needs to be normalized based on the mean and std deviation values
    // to standardize the input data.

    // Create a Tensor to store the standard deviation values for R,G,B
    nvcv::Tensor::Requirements reqsScale       = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
    int64_t                    scaleBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsScale.mem}.cudaMem());
    nvcv::TensorDataStridedCuda::Buffer bufScale;
    std::copy(reqsScale.strides, reqsScale.strides + NVCV_TENSOR_MAX_RANK, bufScale.strides);
    CHECK_CUDA_ERROR(cudaMalloc(&bufScale.basePtr, scaleBufferSize));
    nvcv::TensorDataStridedCuda scaleIn(nvcv::TensorShape{reqsScale.shape, reqsScale.rank, reqsScale.layout},
                                        nvcv::DataType{reqsScale.dtype}, bufScale);

    nvcv::Tensor stddevTensor = TensorWrapData(scaleIn);

    // Create a Tensor to store the mean values for R,G,B
    nvcv::TensorDataStridedCuda::Buffer bufBase;
    nvcv::Tensor::Requirements          reqsBase       = nvcv::Tensor::CalcRequirements(1, {1, 1}, nvcv::FMT_RGBf32);
    int64_t                             baseBufferSize = CalcTotalSizeBytes(nvcv::Requirements{reqsBase.mem}.cudaMem());
    std::copy(reqsBase.strides, reqsBase.strides + NVCV_TENSOR_MAX_RANK, bufBase.strides);
    CHECK_CUDA_ERROR(cudaMalloc(&bufBase.basePtr, baseBufferSize));
    nvcv::TensorDataStridedCuda baseIn(nvcv::TensorShape{reqsBase.shape, reqsBase.rank, reqsBase.layout},
                                       nvcv::DataType{reqsBase.dtype}, bufBase);

    nvcv::Tensor meanTensor = TensorWrapData(baseIn);

    // Copy the values from Host to Device
    // The R,G,B scale and mean will be applied to all the pixels across the batch of input images
    float stddev[3]  = {0.229, 0.224, 0.225};
    float mean[3]    = {0.485f, 0.456f, 0.406f};
    auto  meanData   = meanTensor.exportData<nvcv::TensorDataStridedCuda>();
    auto  stddevData = stddevTensor.exportData<nvcv::TensorDataStridedCuda>();

    // Flag to set the scale value as standard deviation i.e use 1/scale
    uint32_t flags = CVCUDA_NORMALIZE_SCALE_IS_STDDEV;
    CHECK_CUDA_ERROR(cudaMemcpyAsync(stddevData->basePtr(), stddev, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));
    CHECK_CUDA_ERROR(cudaMemcpyAsync(meanData->basePtr(), mean, 3 * sizeof(float), cudaMemcpyHostToDevice, stream));

    nvcv::Tensor normTensor(batchSize, {inputLayerWidth, inputLayerHeight}, nvcv::FMT_RGBf32);

    // Normalize
    cvcuda::Normalize normOp;
    normOp(stream, floatTensor, meanTensor, stddevTensor, normTensor, 1.0f, 0.0f, 0.0f, flags);

    // Convert the data layout from interleaved to planar
    cvcuda::Reformat reformatOp;
    reformatOp(stream, normTensor, outTensor);