I’m trying to get a code working across two GPUs on a single server. I’ve set up peer to peer memory, and each gpu has a stream associated with it. Also, gc_[snd,rcv]_device pointers are allocated on each gpu with cudaMalloc. My code essentially does
cudaSetDevice(gpu1)
//fire off several cuda kernels on gpu1
cudaMemcpyAsync(gc_rcv_device[gpu2], gc_snd_device[gpu1], size_comm3_device,
cudaMemcpyDeviceToDevice, gpu_stream[gpu1]);
cudaStreamSynchronize(gpu_stream[gpu1]);
cudaSetDevice(gpu2)
//fire off several cuda kernels on gpu2
cudaMemcpyAsync(gc_rcv_device[gpu1], gc_snd_device[gpu2], size_comm3_device,
cudaMemcpyDeviceToDevice, gpu_stream[gpu2]);
this returns 'invalid resource handle' before I get to the following sync call
cudaStreamSynchronize(gpu_stream[gpu2]);
Why does the code on gpu1 seem to work, while I get the ‘invalid resource handle’ on gpu2? Advice appreciated. Thanks.
-Jeff