map kernel and copy events to logical stream ID instead of HSA queue ID (#1424) by hannaxu · Pull Request #1424 · pytorch/kineto

hannaxu · 2026-06-05T15:42:12Z

Summary:

Previously, resourceId() propagated the HSA queue ID for kernel events, rather than the logical stream ID. The AMD default is GPU_MAX_HW_QUEUES=4, which follows why stream interleaving was noticeable only when number of workers (total) >= 5 which is common on models with remote (additional remote workers). Therefore, there was stream interleaving as we were mapping > 5 streams to 4 HSA queues. The separate backfillAsyncCopyStreams for H2D/D2D was also propagating HW queue IDs.

Switch to int64_t for stream ids to avoid truncation; HW queue IDs fit fine in int32. Return streamId with queueId fallback inresourceId(). Recover the real HIP stream from the correlated runtime row, resolving both kernels and copies to stream ID via backfillAsyncStreams. Remove separate backfillAsyncCopyStreams. Async rows carry only the HW queue, so we map correlation --> HIP stream from the runtime rows.

Process

Infer stream from queue when a queue maps unambiguously to one stream
Remap raw 64-bit HIP stream pointers to per-device indices since Perfetto misrenders large id values onto the same track
Kernels and copies share one per-device index space, so ops on the same HIP stream land on the same track. resourceId() returns the resolved streamId with a queueId <--> unambiguous stream map fallback

Add the HSA queue ID as a field in the trace, but not as the track provenance in next diff D107568026

Differential Revision: D107306915

meta-codesync · 2026-06-05T15:42:20Z

@hannaxu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107306915.

…ID (pytorch#1424) Summary: Previously, `resourceId()` propagated the HSA queue ID for kernel events, rather than the logical stream ID. The AMD default is `GPU_MAX_HW_QUEUES=4`, which follows why stream interleaving was noticeable only when number of workers (total) >= 5 which is common on models with remote (additional remote workers). Therefore, there was stream interleaving as we were mapping > 5 streams to 4 HSA queues. The separate `backfillAsyncCopyStreams` for H2D/D2D was also propagating HW queue IDs. Switch to int64_t for stream ids to avoid truncation; HW queue IDs fit fine in int32. Return streamId with queueId fallback in`resourceId()`. Recover the real HIP stream from the correlated runtime row, resolving both kernels and copies to stream ID via `backfillAsyncStreams`. Remove separate `backfillAsyncCopyStreams`. Async rows carry only the HW queue, so we map correlation --> HIP stream from the runtime rows. *Process* 1. Infer stream from queue when a queue maps unambiguously to one stream 2. Remap raw 64-bit HIP stream pointers to per-device indices since Perfetto misrenders large id values onto the same track 3. Kernels and copies share one per-device index space, so ops on the same HIP stream land on the same track. `resourceId()` returns the resolved streamId with a queueId <--> unambiguous stream map fallback Add the HSA queue ID as a field in the trace, but not as the track provenance in next diff D107568026 Differential Revision: D107306915

sanrise

LGTM, thanks for adding this -

meta-codesync · 2026-06-10T18:19:36Z

This pull request has been merged in 25856f7.

Includes the following commits: - surface HSA queue ID (pytorch/kineto#1425) 1d43601 - Add TypedMetadata to CUPTI activities (pytorch/kineto#1434) 461a26c - Move around methods so they are accurately described by the comments (pytorch/kineto#1435) ff95265 - Introduce TypedMetadata structs (pytorch/kineto#1433) 965a14d - Add unit tests for sync/async interleaving behavior (pytorch/kineto#1431) 67612e6 - Restrict on-demand trace output path to a safe directory (pytorch/kineto#1426) 9ea826e - Add a cancellation state to the state machine and remove syncTraceActive_ (pytorch/kineto#1355) 3289139 - Move state machine from profiler into async handler (pytorch/kineto#1352) eb578c5 - map kernel and copy events to logical stream ID instead of HSA queue ID (pytorch/kineto#1424) 25856f7 - Speedup and fix PyTorch CI build (pytorch/kineto#1428) e675bbb - Add logging for async rejections (pytorch/kineto#1422) 14becfb - Make sure we clear the logger observer before starting a new session (pytorch/kineto#1387) a8cae92 - Add a scope guard for logging to help us capture all exit paths (pytorch/kineto#1383) ce13b13 - Remove roctracer support (pytorch/kineto#1419) 2a3a002

Includes the following commits: - surface HSA queue ID (pytorch/kineto#1425) 1d43601 - Add TypedMetadata to CUPTI activities (pytorch/kineto#1434) 461a26c - Move around methods so they are accurately described by the comments (pytorch/kineto#1435) ff95265 - Introduce TypedMetadata structs (pytorch/kineto#1433) 965a14d - Add unit tests for sync/async interleaving behavior (pytorch/kineto#1431) 67612e6 - Restrict on-demand trace output path to a safe directory (pytorch/kineto#1426) 9ea826e - Add a cancellation state to the state machine and remove syncTraceActive_ (pytorch/kineto#1355) 3289139 - Move state machine from profiler into async handler (pytorch/kineto#1352) eb578c5 - map kernel and copy events to logical stream ID instead of HSA queue ID (pytorch/kineto#1424) 25856f7 - Speedup and fix PyTorch CI build (pytorch/kineto#1428) e675bbb - Add logging for async rejections (pytorch/kineto#1422) 14becfb - Make sure we clear the logger observer before starting a new session (pytorch/kineto#1387) a8cae92 - Add a scope guard for logging to help us capture all exit paths (pytorch/kineto#1383) ce13b13 - Remove roctracer support (pytorch/kineto#1419) 2a3a002 Pull Request resolved: #187440 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007

meta-cla Bot added the cla signed label Jun 5, 2026

meta-codesync Bot added the meta-exported label Jun 5, 2026

facebook-github-tools Bot added the module: rocm label Jun 5, 2026

meta-codesync Bot changed the title ~~map kernel and copy events to logical stream ID instead of HSA queue ID~~ map kernel and copy events to logical stream ID instead of HSA queue ID (#1424) Jun 5, 2026

hannaxu force-pushed the export-D107306915 branch 2 times, most recently from 1484cd9 to 7275e8a Compare June 5, 2026 18:33

hannaxu force-pushed the export-D107306915 branch from 7275e8a to 237fc78 Compare June 9, 2026 15:27

sanrise approved these changes Jun 9, 2026

View reviewed changes

meta-codesync Bot closed this in 25856f7 Jun 10, 2026

meta-codesync Bot added the Merged label Jun 10, 2026

scotts mentioned this pull request Jun 16, 2026

Update third_party/kineto submodule to 1d43601 pytorch/pytorch#187438

Closed

scotts mentioned this pull request Jun 16, 2026

Update third_party/kineto submodule to 1d43601 pytorch/pytorch#187440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

map kernel and copy events to logical stream ID instead of HSA queue ID (#1424)#1424

map kernel and copy events to logical stream ID instead of HSA queue ID (#1424)#1424
hannaxu wants to merge 1 commit into
pytorch:mainfrom
hannaxu:export-D107306915

hannaxu commented Jun 5, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Jun 5, 2026

Uh oh!

sanrise left a comment

Uh oh!

meta-codesync Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hannaxu commented Jun 5, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 5, 2026

Uh oh!

sanrise left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hannaxu commented Jun 5, 2026 •

edited by meta-codesync Bot

Loading