cudaMemcpyAsync returns 'invalid resource handle'

I’m trying to get a code working across two GPUs on a single server. I’ve set up peer to peer memory, and each gpu has a stream associated with it. Also, gc_[snd,rcv]_device pointers are allocated on each gpu with cudaMalloc. My code essentially does

cudaSetDevice(gpu1)
//fire off several cuda kernels on gpu1
cudaMemcpyAsync(gc_rcv_device[gpu2], gc_snd_device[gpu1], size_comm3_device,
                        cudaMemcpyDeviceToDevice, gpu_stream[gpu1]);
cudaStreamSynchronize(gpu_stream[gpu1]);

cudaSetDevice(gpu2)
//fire off several cuda kernels on gpu2
cudaMemcpyAsync(gc_rcv_device[gpu1], gc_snd_device[gpu2], size_comm3_device,
                        cudaMemcpyDeviceToDevice, gpu_stream[gpu2]);

this returns  'invalid resource handle' before I get to the following sync call

cudaStreamSynchronize(gpu_stream[gpu2]);

Why does the code on gpu1 seem to work, while I get the ‘invalid resource handle’ on gpu2? Advice appreciated. Thanks.

-Jeff

invalid resource handle in this setting often means that you are attempting to use a stream that is not associated with the current device.

In this setting, it would mean that when you created the stream gpu_stream[gpu2], the device that was selected was actually gpu1, not gpu2.

I imagine you may think that is not the case. I don’t have any further ideas. If you create a short complete code that demonstrates the issue, you will probably spot the problem before you even need to show the code to anyone else.