Skip to content

map kernel and copy events to logical stream ID instead of HSA queue ID (#1424)#1424

Closed
hannaxu wants to merge 1 commit into
pytorch:mainfrom
hannaxu:export-D107306915
Closed

map kernel and copy events to logical stream ID instead of HSA queue ID (#1424)#1424
hannaxu wants to merge 1 commit into
pytorch:mainfrom
hannaxu:export-D107306915

Conversation

@hannaxu

@hannaxu hannaxu commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary:

Previously, resourceId() propagated the HSA queue ID for kernel events, rather than the logical stream ID. The AMD default is GPU_MAX_HW_QUEUES=4, which follows why stream interleaving was noticeable only when number of workers (total) >= 5 which is common on models with remote (additional remote workers). Therefore, there was stream interleaving as we were mapping > 5 streams to 4 HSA queues. The separate backfillAsyncCopyStreams for H2D/D2D was also propagating HW queue IDs.

Switch to int64_t for stream ids to avoid truncation; HW queue IDs fit fine in int32. Return streamId with queueId fallback inresourceId(). Recover the real HIP stream from the correlated runtime row, resolving both kernels and copies to stream ID via backfillAsyncStreams. Remove separate backfillAsyncCopyStreams. Async rows carry only the HW queue, so we map correlation --> HIP stream from the runtime rows.

Process

  1. Infer stream from queue when a queue maps unambiguously to one stream
  2. Remap raw 64-bit HIP stream pointers to per-device indices since Perfetto misrenders large id values onto the same track
  3. Kernels and copies share one per-device index space, so ops on the same HIP stream land on the same track. resourceId() returns the resolved streamId with a queueId <--> unambiguous stream map fallback

Add the HSA queue ID as a field in the trace, but not as the track provenance in next diff D107568026

Differential Revision: D107306915

@meta-cla meta-cla Bot added the cla signed label Jun 5, 2026
@meta-codesync

meta-codesync Bot commented Jun 5, 2026

Copy link
Copy Markdown

@hannaxu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107306915.

@meta-codesync meta-codesync Bot changed the title map kernel and copy events to logical stream ID instead of HSA queue ID map kernel and copy events to logical stream ID instead of HSA queue ID (#1424) Jun 5, 2026
@hannaxu hannaxu force-pushed the export-D107306915 branch 2 times, most recently from 1484cd9 to 7275e8a Compare June 5, 2026 18:33
…ID (pytorch#1424)

Summary:

Previously, `resourceId()` propagated the HSA queue ID for kernel events, rather than the logical stream ID. The AMD default is `GPU_MAX_HW_QUEUES=4`, which follows why stream interleaving was noticeable only when number of workers (total) >= 5 which is common on models with remote (additional remote workers). Therefore, there was stream interleaving as we were mapping > 5 streams to 4 HSA queues. The separate `backfillAsyncCopyStreams` for H2D/D2D was also propagating HW queue IDs. 

Switch to int64_t for stream ids to avoid truncation; HW queue IDs fit fine in int32. Return streamId with queueId fallback in`resourceId()`. Recover the real HIP stream from the correlated runtime row, resolving both kernels and copies to stream ID via `backfillAsyncStreams`. Remove separate `backfillAsyncCopyStreams`. Async rows carry only the HW queue, so we map correlation --> HIP stream from the runtime rows.

*Process*
1. Infer stream from queue when a queue maps unambiguously to one stream
2. Remap raw 64-bit HIP stream pointers to per-device indices since Perfetto misrenders large id values onto the same track
3. Kernels and copies share one per-device index space, so ops on the same HIP stream land on the same track. `resourceId()` returns the resolved streamId with a queueId <--> unambiguous stream map fallback


Add the HSA queue ID as a field in the trace, but not as the track provenance in next diff D107568026

Differential Revision: D107306915
@hannaxu hannaxu force-pushed the export-D107306915 branch from 7275e8a to 237fc78 Compare June 9, 2026 15:27
hannaxu added a commit to hannaxu/kineto that referenced this pull request Jun 9, 2026
…ID (pytorch#1424)

Summary:

Previously, `resourceId()` propagated the HSA queue ID for kernel events, rather than the logical stream ID. The AMD default is `GPU_MAX_HW_QUEUES=4`, which follows why stream interleaving was noticeable only when number of workers (total) >= 5 which is common on models with remote (additional remote workers). Therefore, there was stream interleaving as we were mapping > 5 streams to 4 HSA queues. The separate `backfillAsyncCopyStreams` for H2D/D2D was also propagating HW queue IDs. 

Switch to int64_t for stream ids to avoid truncation; HW queue IDs fit fine in int32. Return streamId with queueId fallback in`resourceId()`. Recover the real HIP stream from the correlated runtime row, resolving both kernels and copies to stream ID via `backfillAsyncStreams`. Remove separate `backfillAsyncCopyStreams`. Async rows carry only the HW queue, so we map correlation --> HIP stream from the runtime rows.

*Process*
1. Infer stream from queue when a queue maps unambiguously to one stream
2. Remap raw 64-bit HIP stream pointers to per-device indices since Perfetto misrenders large id values onto the same track
3. Kernels and copies share one per-device index space, so ops on the same HIP stream land on the same track. `resourceId()` returns the resolved streamId with a queueId <--> unambiguous stream map fallback


Add the HSA queue ID as a field in the trace, but not as the track provenance in next diff D107568026

Differential Revision: D107306915

@sanrise sanrise left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this -

@meta-codesync meta-codesync Bot closed this in 25856f7 Jun 10, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 10, 2026
@meta-codesync

meta-codesync Bot commented Jun 10, 2026

Copy link
Copy Markdown

This pull request has been merged in 25856f7.

scotts added a commit to scotts/pytorch that referenced this pull request Jun 16, 2026
Includes the following commits:

- surface HSA queue ID (pytorch/kineto#1425) 1d43601
- Add TypedMetadata to CUPTI activities (pytorch/kineto#1434) 461a26c
- Move around methods so they are accurately described by the comments (pytorch/kineto#1435) ff95265
- Introduce TypedMetadata structs (pytorch/kineto#1433) 965a14d
- Add unit tests for sync/async interleaving behavior (pytorch/kineto#1431) 67612e6
- Restrict on-demand trace output path to a safe directory (pytorch/kineto#1426) 9ea826e
- Add a cancellation state to the state machine and remove syncTraceActive_ (pytorch/kineto#1355) 3289139
- Move state machine from profiler into async handler (pytorch/kineto#1352) eb578c5
- map kernel and copy events to logical stream ID instead of HSA queue ID (pytorch/kineto#1424) 25856f7
- Speedup and fix PyTorch CI build (pytorch/kineto#1428) e675bbb
- Add logging for async rejections (pytorch/kineto#1422) 14becfb
- Make sure we clear the logger observer before starting a new session (pytorch/kineto#1387) a8cae92
- Add a scope guard for logging to help us capture all exit paths (pytorch/kineto#1383) ce13b13
- Remove roctracer support (pytorch/kineto#1419) 2a3a002
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jun 16, 2026
Includes the following commits:

- surface HSA queue ID (pytorch/kineto#1425) 1d43601
- Add TypedMetadata to CUPTI activities (pytorch/kineto#1434) 461a26c
- Move around methods so they are accurately described by the comments (pytorch/kineto#1435) ff95265
- Introduce TypedMetadata structs (pytorch/kineto#1433) 965a14d
- Add unit tests for sync/async interleaving behavior (pytorch/kineto#1431) 67612e6
- Restrict on-demand trace output path to a safe directory (pytorch/kineto#1426) 9ea826e
- Add a cancellation state to the state machine and remove syncTraceActive_ (pytorch/kineto#1355) 3289139
- Move state machine from profiler into async handler (pytorch/kineto#1352) eb578c5
- map kernel and copy events to logical stream ID instead of HSA queue ID (pytorch/kineto#1424) 25856f7
- Speedup and fix PyTorch CI build (pytorch/kineto#1428) e675bbb
- Add logging for async rejections (pytorch/kineto#1422) 14becfb
- Make sure we clear the logger observer before starting a new session (pytorch/kineto#1387) a8cae92
- Add a scope guard for logging to help us capture all exit paths (pytorch/kineto#1383) ce13b13
- Remove roctracer support (pytorch/kineto#1419) 2a3a002
Pull Request resolved: #187440
Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants