Skip to content

Conversation

@rillomas
Copy link
Contributor

@rillomas rillomas commented Nov 27, 2025

We are currently seeing a mismatch in all TOPK_MOE unit tests on our upcoming platform. This seems to be due to an implicit assumption in subgroup and workgroup mapping. For example the current shader assumes that each subgroup lane is contiguous in the x dimension of a workgroup which is not exactly guaranteed (though this may not be the exact reason for the mismatch we're seeing now).

Currently experimenting for the fix.

@rillomas rillomas changed the title vulkan; Fix mismatch in TOPK_MOE unit test vulkan: Fix mismatch in TOPK_MOE unit test Nov 27, 2025
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 27, 2025
@jeffbolznv
Copy link
Collaborator

I agree a change like this is necessary due to this spec language: "There is no direct relationship between SubgroupLocalInvocationId and LocalInvocationId or LocalInvocationIndex". I guess it's a problem specifically for this shader because it (1) uses local_size_y > 1 and (2) uses subgroup instructions?

I wonder if a similar problem could be affecting the shader in #17389, though it doesn't use local_size_y > 1 so maybe not.

@rillomas
Copy link
Contributor Author

Thanks for your comment. Yes, this may be one of those issues that were hidden due to the driver's default behavior.

@rillomas
Copy link
Contributor Author

rillomas commented Nov 28, 2025

It turns out that a single subgroup was mapped in a 2D workgroup on our failing environment. For subgroupID 0 we get gl_LocalInvocationID.x: 0-7 and gl_LocalInvocationID.y: 0-3, whereas previously it was gl_LocalInvocationID.x: 0-31 and gl_LocalInvocationID.y: 0. 62ed513 passes all test-backend-ops on the target environment but still need to cleanup and test other environments.

@rillomas rillomas force-pushed the fix-topk-moe-mismatch branch from 42ace01 to 4404be4 Compare November 28, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants