[Language]Adds a random number generation capability #1196

Cunxiao2002 · 2025-11-05T12:55:50Z

Summary by CodeRabbit

New Features
- Added random number generation support for tensor buffers, enabling 1D and 2D buffer population with Philox-based random values.
Tests
- Added comprehensive test suite validating random number generation across multiple implementations and configurations with numerical accuracy checks.

introduce T.rand

coderabbitai · 2025-11-05T12:56:00Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

The PR introduces Philox-based random number generation support to TileLang, spanning Python API layer, C++ intrinsic registration, CUDA codegen integration, and device-side implementation, alongside comprehensive 1D and 2D test cases validating cross-implementation consistency.

Changes

Cohort / File(s)	Summary
Python API & Tests `tilelang/language/__init__.py`, `tilelang/language/random.py`	Added public `rand()` function for Philox RNG with shape validation; internal `_rand_parallel_impl()` macro wrapper; re-exported `rand` at package level.
Test Suite `testing/python/language/test_tilelang_language_rand.py`	New test module with 1D/2D RNG kernels, cross-validation against Triton and Torch, seeded dropout path, and CUDA-gated test execution.
C++ Builtin Infrastructure `src/op/builtin.h`, `src/op/builtin.cc`	Added `philox_rand()` intrinsic declaration and TL builtin registration with 6-input signature and opaque call effect.
CUDA Codegen `src/target/codegen_cuda.h`, `src/target/codegen_cuda.cc`	Introduced `need_random_h_` flag; integrated philox_rand intrinsic handling into CallNode emission and Finish phase, including conditional random.h header emission.
CUDA Device Implementation `src/tl_templates/cuda/random.h`	Added Philox RNG kernel (`philox_rand`), helper functions (`umulhi_uint32`, `philox_impl_device`, `uint32_to_uniform_float_device`), with per-thread work distribution and bounds checking.

Sequence Diagram

sequenceDiagram
    participant PyUser as Python User
    participant PyAPI as tilelang.language.rand()
    participant Macro as _rand_parallel_impl()
    participant TVMCall as T.call_intrin(philox_rand)
    participant Codegen as CUDA Codegen
    participant DevKernel as Device Kernel (philox_rand)
    
    PyUser->>PyAPI: rand(buffer, seed, n_rounds)
    activate PyAPI
    PyAPI->>PyAPI: Validate shape (1D or 2D)
    PyAPI->>PyAPI: Compute total_elems, block_m, block_n
    PyAPI->>Macro: _rand_parallel_impl(buffer, seed, ...)
    deactivate PyAPI
    
    activate Macro
    Macro->>TVMCall: T.call_intrin(philox_rand, ...)
    deactivate Macro
    
    TVMCall->>Codegen: Emit CUDA kernel call
    activate Codegen
    Codegen->>Codegen: Set need_random_h_ flag
    Codegen->>Codegen: Emit extern call to tl::philox_rand
    Codegen->>Codegen: Include random.h in Finish
    deactivate Codegen
    
    Codegen->>DevKernel: Execute on CUDA device
    activate DevKernel
    DevKernel->>DevKernel: Initialize Philox state from seed
    DevKernel->>DevKernel: Run philox_impl_device (n_rounds)
    DevKernel->>DevKernel: Convert uint32 → uniform float
    DevKernel->>DevKernel: Write to output buffer
    deactivate DevKernel

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

src/tl_templates/cuda/random.h: Device-side Philox implementation with bit-manipulation logic (umulhi_uint32, state transforms) requires verification of cryptographic soundness and numerical correctness.
tilelang/language/random.py: Shape validation logic and parameter derivation (block_m, block_n calculation) merit careful review.
src/target/codegen_cuda.cc/h: Integration points with existing codegen patterns; verify flag lifecycle and header inclusion correctness.
testing/python/language/test_tilelang_language_rand.py: Cross-implementation validation logic (1D, 2D, seeded dropout) and tolerance thresholds (atol/rtol ~1e-3) warrant scrutiny for floating-point equivalence claims.

Poem

🐰 A Philox of seeds we now shall sow,
With random bits in CUDA's glow,
Across the tensors, fast and free,
From one-dee to two-dee spree,
Round and round the cryptos go! 🎲

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.74% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: adding random number generation capability. It is concise, clear, and accurately reflects the primary purpose of the pull request across multiple files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebd64c3 and e381855.

📒 Files selected for processing (3)

src/op/builtin.cc (1 hunks)
src/tl_templates/cuda/random.h (1 hunks)
testing/python/language/test_tilelang_language_rand.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/language/test_tilelang_language_rand.py (6)

src/tl_templates/cuda/random.h (1)

tl (5-93)

tilelang/env.py (1)

disable_cache (271-272)

tilelang/jit/kernel.py (1)

out_idx (461-462)

tilelang/language/allocate.py (1)

alloc_fragment (59-70)

tilelang/language/random.py (1)

rand (25-40)

tilelang/language/parallel.py (1)

Parallel (9-29)

🪛 Ruff (0.14.3)

testing/python/language/test_tilelang_language_rand.py

27-27: Unused function argument: dtype

(ARG001)

50-50: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-05T12:56:01Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

testing/python/language/test_rand.py (1)
9-21: Simplify indexing since only one block is used.

Line 19 uses bx * M + i, but since the kernel is launched with only 1 block (line 14), bx is always 0. The multiplication is unnecessary and potentially misleading.

Apply this diff to simplify the indexing:
            for i in T.Parallel(M):
-                A[bx * M + i] = rand_buffer[i]
+                A[i] = rand_buffer[i]
tilelang/language/random.py (2)
39-47: Consider simplifying the uint32-to-float conversion.

The current implementation reinterprets as signed int32 and handles negative values, which works but is more complex than necessary. A simpler approach would be to convert the uint32 directly to float and scale by 1/2^32.

Alternative implementation:
 @T.macro
 def uint32_to_uniform_float(x: tir.PrimExpr) -> tir.PrimExpr:
-    assert x.dtype == 'uint32' or x.dtype == "int32", f"x.dtype {x.dtype} is not supported"
-    x_int32 = T.reinterpret('int32', x)
-    scale = tir.const(4.6566127342e-10, "float32")
-
-    x_abs = T.if_then_else(x_int32 < 0, -x_int32 - 1, x_int32)
-
-    return T.Cast("float32", x_abs) * scale
+    assert x.dtype == 'uint32', f"x.dtype {x.dtype} is not supported"
+    # Scale factor: 1 / 2^32 = 2.3283064365e-10
+    scale = tir.const(2.3283064365e-10, "float32")
+    return T.Cast("float32", x) * scale
This avoids the signed interpretation and conditional logic. If the current implementation has specific numerical properties you want to preserve, please document why.

50-62: Efficiency concern: Only 1 of 4 Philox outputs is used.

The Philox algorithm produces four independent random values (c0, c1, c2, c3) per invocation, but only c0 is used. This wastes 75% of the generated entropy and is computationally inefficient.

Consider optimizing by processing 4 elements per Philox call:
@T.macro
def _rand_parallel_impl(buffer: T.Buffer, seed_lo, seed_hi, total_elems, n_rounds):
    # Process 4 elements at a time
    for i in T.Parallel((total_elems + 3) // 4):
        offset = T.Cast("uint32", i)
        offset_lo = offset
        offset_hi = tir.const(0, "uint32")
        
        c0, c1, c2, c3 = philox_impl(offset_lo, offset_hi, tir.const(0, "uint32"),
                                     tir.const(0, "uint32"), seed_lo, seed_hi, n_rounds)
        
        # Store up to 4 values
        for j in T.serial(4):
            idx = i * 4 + j
            if idx < total_elems:
                coords = index_to_coordinates(idx, buffer.shape)
                rand_val = [c0, c1, c2, c3][j]
                buffer[coords] = uint32_to_uniform_float(rand_val)
This would reduce the number of Philox calls by 4×, significantly improving performance for large buffers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9d823b and a8d70a2.

📒 Files selected for processing (3)

testing/python/language/test_rand.py (1 hunks)
tilelang/language/__init__.py (1 hunks)
tilelang/language/random.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tilelang/language/__init__.py (1)

tilelang/language/random.py (1)

rand (65-74)

testing/python/language/test_rand.py (5)

tilelang/jit/kernel.py (1)

out_idx (461-462)

tilelang/language/proxy.py (1)

Tensor (252-253)

tilelang/language/allocate.py (1)

alloc_fragment (59-70)

tilelang/language/random.py (1)

rand (65-74)

tilelang/language/parallel.py (1)

Parallel (9-29)

tilelang/language/random.py (3)

tilelang/language/utils.py (1)

index_to_coordinates (91-110)

tilelang/language/tir/op.py (2)

reinterpret (2006-2025)

if_then_else (3014-3044)

tilelang/language/parallel.py (1)

Parallel (9-29)

🪛 Ruff (0.14.3)

tilelang/language/__init__.py

95-95: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

testing/python/language/test_rand.py

25-25: Unused function argument: dtype

(ARG001)

tilelang/language/random.py

58-58: Unpacked variable c1 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

58-58: Unpacked variable c2 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

58-58: Unpacked variable c3 is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (5)

tilelang/language/__init__.py (1)

95-96: LGTM! Clean re-export of the new rand API.

The import follows the established pattern in this file for exposing module-level APIs. The noqa: F401 directive is appropriate for re-exported symbols.

testing/python/language/test_rand.py (1)

45-47: LGTM!

The test function is properly guarded with @tilelang.testing.requires_cuda and correctly invokes the test harness.

tilelang/language/random.py (3)

6-8: LGTM! Correct implementation of high-bits multiplication.

The function correctly computes the high 32 bits of a 64-bit product using widening multiplication and bit shifting.

11-36: LGTM! Correct Philox-4x32 implementation.

The implementation follows the standard Philox-4x32 algorithm with correct constants and round functions. The key evolution and counter mixing are properly implemented.

65-74: LGTM! Clean public API implementation.

The function correctly:

Converts the seed to uint64 and splits it into 32-bit components

Calculates the total number of elements from the buffer shape

Delegates to the parallel implementation

coderabbitai · 2025-11-05T13:01:49Z

testing/python/language/test_tilelang_language_rand.py

+@triton.jit
+def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
+    pid = tl.program_id(0)
+    offset = pid * M + tl.arange(0, M)
+    rand = tl.rand(seed, offset)
+    tl.store(X + offset, rand, mask=offset < M)


⚠️ Potential issue | 🟠 Major

Remove unused parameter and fix mask logic.

Two issues:

The dtype parameter is declared but never used in the function body.

The mask offset < M is incorrect. Since offset = pid * M + tl.arange(0, M), the values range from pid * M to pid * M + M - 1, which are always >= M when pid > 0. The mask should check against a total size or be removed if the grid guarantees coverage.

Apply this diff:

@triton.jit -def triton_rand_1d(X, M, seed, dtype: tl.constexpr): +def triton_rand_1d(X, M, seed, TOTAL_SIZE): pid = tl.program_id(0) offset = pid * M + tl.arange(0, M) rand = tl.rand(seed, offset) - tl.store(X + offset, rand, mask=offset < M) + tl.store(X + offset, rand, mask=offset < TOTAL_SIZE)

Note: If the grid is always guaranteed to match the total size exactly, the mask can be omitted entirely.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@triton.jit

def triton_rand_1d(X, M, seed, dtype: tl.constexpr):

pid = tl.program_id(0)

offset = pid * M + tl.arange(0, M)

rand = tl.rand(seed, offset)

tl.store(X + offset, rand, mask=offset < M)

@triton.jit

def triton_rand_1d(X, M, seed, TOTAL_SIZE):

pid = tl.program_id(0)

offset = pid * M + tl.arange(0, M)

rand = tl.rand(seed, offset)

tl.store(X + offset, rand, mask=offset < TOTAL_SIZE)

🧰 Tools

🪛 Ruff (0.14.3)

25-25: Unused function argument: dtype

(ARG001)

🤖 Prompt for AI Agents

In testing/python/language/test_rand.py around lines 24 to 29, the triton_rand_1d function declares an unused dtype parameter and uses an incorrect mask; remove the unused dtype parameter from the signature and any callers, and fix the mask by either (a) replacing "offset < M" with a check against the total number of elements (e.g., "offset < total_size") after adding a total_size argument, or (b) if the kernel grid is guaranteed to exactly cover the array, remove the mask entirely; update any call sites accordingly to pass total_size if you choose option (a).

testing/python/language/test_tilelang_language_rand.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

testing/python/language/test_tilelang_language_rand.py (2)

24-29: Remove unused parameter and fix mask logic.

The issues flagged in the previous review remain:

The dtype parameter is declared but never used

The mask offset < M is incorrect when pid > 0 since offset ranges from pid * M to pid * M + M - 1

32-42: Fix Triton kernel invocation and remove unnecessary tensor allocation.

The issues flagged in the previous review remain:

Line 34 creates tilelang_result unnecessarily before reassignment on line 35

Line 39 uses tl.constexpr(M) incorrectly—constexpr is a type hint, not a runtime wrapper

Line 40 passes unused dtype parameter to the kernel

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8d70a2 and 760536a.

📒 Files selected for processing (2)

testing/python/language/test_tilelang_language_rand.py (1 hunks)
tilelang/language/random.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

tilelang/language/random.py (2)

tilelang/language/__init__.py (1)

import_source (98-100)

tilelang/language/tir/op.py (2)

call_extern (173-195)

address_of (464-480)

testing/python/language/test_tilelang_language_rand.py (3)

tilelang/language/allocate.py (1)

alloc_fragment (59-70)

tilelang/language/random.py (1)

rand (92-97)

tilelang/language/parallel.py (1)

Parallel (9-29)

🪛 Ruff (0.14.3)

testing/python/language/test_tilelang_language_rand.py

25-25: Unused function argument: dtype

(ARG001)

🔇 Additional comments (6)

testing/python/language/test_tilelang_language_rand.py (3)

1-6: LGTM!

The imports are appropriate for testing TileLang and Triton random number generation.

45-51: LGTM!

The test structure is appropriate. The @tilelang.testing.requires_cuda decorator correctly guards CUDA-dependent tests.

14-15: Review comment appears speculative without definitive evidence of a problem.

The original comment raises reasonable architectural concerns about single-block kernels and fragment buffer sizes, but the search results do not reveal explicit size constraints in the tilelang codebase. The code under review is a test file with intentional simplifications—allocating a 1D fragment of size M for a random number fill is not unusual for test scenarios. The concern about "large M values" is not substantiated by error handling or documented limits found in the codebase.

Without concrete evidence that fragment buffers have enforced size limits (the allocate.py documentation in the search results does not specify such limits), and given this is a test rather than production code, the review comment represents a best-practice suggestion rather than a critical issue.

tilelang/language/random.py (3)

8-12: LGTM!

The inline PTX assembly for computing the high 32 bits of unsigned multiplication is correct and efficient.

14-44: LGTM!

The Philox algorithm implementation is correct. The constants and key schedule match the standard Philox specification.

92-97: LGTM!

The public API correctly computes the total number of elements and delegates to the implementation macro.

tilelang/language/random.py

coderabbitai · 2025-11-06T09:44:35Z

tilelang/language/random.py

+__device__ void philox_rand_kernel(
+    float* output,
+    int total_elems,
+    uint64_t seed,
+    int n_rounds
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx >= total_elems) return;
+
+    uint32_t seed_lo = (uint32_t)(seed & 0xFFFFFFFFULL);
+    uint32_t seed_hi = (uint32_t)((seed >> 32) & 0xFFFFFFFFULL);
+    uint32_t offset_lo = (uint32_t)idx;
+    uint32_t offset_hi = 0U;
+    uint32_t c0 = offset_lo;
+    uint32_t c1 = offset_hi;
+    uint32_t c2 = 0U;
+    uint32_t c3 = 0U;
+    philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
+    output[idx] = uint32_to_uniform_float_device(c0);
+}


⚠️ Potential issue | 🟡 Minor

Consider efficiency improvement and note dtype limitation.

Two observations:

Efficiency: The Philox algorithm generates 4 random uint32 values (c0, c1, c2, c3) per invocation, but only c0 is used (line 74). For better throughput, consider generating 4 output values per thread.

Hardcoded dtype: The kernel signature uses float* output (line 57), which limits the implementation to float32. This conflicts with the dtype parameter passed to _rand_parallel_impl (line 80), which suggests multi-dtype support.

For the efficiency improvement, you could modify the kernel to write 4 consecutive values:

- output[idx] = uint32_to_uniform_float_device(c0); + int base_idx = idx * 4; + if (base_idx < total_elems) output[base_idx] = uint32_to_uniform_float_device(c0); + if (base_idx + 1 < total_elems) output[base_idx + 1] = uint32_to_uniform_float_device(c1); + if (base_idx + 2 < total_elems) output[base_idx + 2] = uint32_to_uniform_float_device(c2); + if (base_idx + 3 < total_elems) output[base_idx + 3] = uint32_to_uniform_float_device(c3);

Note: This would require adjusting the grid/block dimensions accordingly.

tilelang/language/random.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tilelang/language/random.py (1)

6-6: Drop or wire through the unused dtype parameter.

dtype never leaves _rand_parallel_impl, so it’s confusing noise and will trigger lint. Either remove it or pass it through to the intrinsic once multi-dtype support lands.
src/tl_templates/cuda/random.h (1)
33-36: Simplify: Remove unnecessary uint64_t casts.

The uint64_t casts in these operations are unnecessary:

Lines 33-34: Casting to uint64_t then truncating to uint32_t for the low 32 bits can be replaced with direct multiplication, which naturally truncates.

Lines 35-36: Casting to uint64_t for uint32_t addition is unnecessary since uint32_t addition naturally wraps on overflow (which is the intended behavior for Philox).

Apply this diff to simplify the code:
-    c1_val = (uint32_t)((uint64_t)B * (uint64_t)_c2);
-    c3_val = (uint32_t)((uint64_t)A * (uint64_t)_c0);
-    k0_val = (uint32_t)((uint64_t)k0_val + PHILOX_KEY_A);
-    k1_val = (uint32_t)((uint64_t)k1_val + PHILOX_KEY_B);
+    c1_val = B * _c2;
+    c3_val = A * _c0;
+    k0_val += PHILOX_KEY_A;
+    k1_val += PHILOX_KEY_B;

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 760536a and d63dfbb.

📒 Files selected for processing (6)

src/op/builtin.cc (1 hunks)
src/op/builtin.h (1 hunks)
src/target/codegen_cuda.cc (2 hunks)
src/target/codegen_cuda.h (1 hunks)
src/tl_templates/cuda/random.h (1 hunks)
tilelang/language/random.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-03T06:24:11.411Z

Learnt from: Rachmanino
Repo: tile-ai/tilelang PR: 1175
File: src/op/math.cc:44-52
Timestamp: 2025-11-03T06:24:11.411Z
Learning: In tilelang's `src/op/math.cc`, the `infinity_op` function uses `std::numeric_limits<float>::infinity()` as a placeholder for all float types (including float64 and bfloat16). The codegen layer (PrintConst:Inf) handles the correct infinity value based on the dtype field of the FloatImm node, so the specific C++ template argument doesn't matter.

Applied to files:

src/target/codegen_cuda.cc

🧬 Code graph analysis (2)

src/target/codegen_cuda.cc (1)

src/target/codegen_cuda.h (1)

need_random_h_ (126-156)

tilelang/language/random.py (2)

src/op/builtin.h (1)

tvm (13-577)

tilelang/language/tir/op.py (2)

call_intrin (120-145)

address_of (464-480)

🪛 Ruff (0.14.3)

tilelang/language/random.py

6-6: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (2)

src/tl_templates/cuda/random.h (2)

7-11: LGTM: Standard PTX multiply-high implementation.

The inline PTX assembly correctly computes the high 32 bits of a 32×32→64 multiply, which is the standard approach for CUDA device code.

52-68: Kernel logic is correct, pending float conversion fix.

The kernel implementation correctly:

Computes per-thread indices with bounds checking

Initializes Philox state using seed and thread index as offset

Calls the Philox round function

Writes the converted output

However, the output correctness depends on fixing the uint32_to_uniform_float_device function flagged above. Once that conversion is corrected, this kernel will produce properly distributed random floats.

coderabbitai · 2025-11-06T14:29:12Z

src/tl_templates/cuda/random.h

+TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
+  const float scale = 4.6566127342e-10f;
+  int32_t x_int32;
+  memcpy(&x_int32, &x, sizeof(uint32_t));
+  int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
+  return (float)x_abs * scale;
+}


⚠️ Potential issue | 🔴 Critical

Fix incorrect uint32-to-float conversion logic.

The current implementation has a critical logic error that produces incorrect and biased random distributions:

It reinterprets uint32_t as int32_t and takes an absolute value, which is incorrect for values ≥ 2³¹

The scale factor 4.6566127342e-10f ≈ 1/2³¹ instead of the correct 1/2³² ≈ 2.3283064365e-10f

This causes uint32 values in [2³¹, 2³²) to wrap around and map back to [0, 1), creating significant bias

For example:

0x80000000 → interpreted as -2147483648 → abs gives 2147483647 → ~1.0

0xFFFFFFFF → interpreted as -1 → abs gives 0 → ~0.0

The standard conversion for uint32 to uniform float [0, 1) is simply: x * 2.3283064365386963e-10f

Apply this diff to fix the conversion:

TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) { - const float scale = 4.6566127342e-10f; - int32_t x_int32; - memcpy(&x_int32, &x, sizeof(uint32_t)); - int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32; - return (float)x_abs * scale; + return x * 2.3283064365386963e-10f; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {

const float scale = 4.6566127342e-10f;

int32_t x_int32;

memcpy(&x_int32, &x, sizeof(uint32_t));

int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;

return (float)x_abs * scale;

}

TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {

return x * 2.3283064365386963e-10f;

}

🤖 Prompt for AI Agents

In src/tl_templates/cuda/random.h around lines 44 to 50, the uint32-to-float conversion wrongly reinterprets the uint32 as int32, takes an abs (causing wrap/bias for values >= 2^31) and uses the wrong scale; replace the body so it does not cast/reinterpret to signed, remove memcpy and abs, and return the uint32 multiplied by the correct 1/2^32 scale constant (2.3283064365386963e-10f) cast to float so the function computes a uniform float in [0,1) without bias.

coderabbitai · 2025-11-06T14:29:12Z

tilelang/language/random.py

+    T.call_intrin(
+        "handle",
+        tir.op.Op.get("tl.philox_rand"),
+        T.address_of(buffer[0]),
+        total_elems,
+        seed,
+        n_rounds,
+    )


⚠️ Potential issue | 🔴 Critical

Wrap the intrinsic in T.evaluate.

As written, T.call_intrin just builds a PrimExpr. Without T.evaluate(...) the call never gets emitted into the TIR, so the random kernel is never invoked.

Apply this diff:

- T.call_intrin( - "handle", - tir.op.Op.get("tl.philox_rand"), - T.address_of(buffer[0]), - total_elems, - seed, - n_rounds, - ) + T.evaluate( + T.call_intrin( + "handle", + tir.op.Op.get("tl.philox_rand"), + T.address_of(buffer[0]), + total_elems, + seed, + n_rounds, + ) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T.call_intrin(

"handle",

tir.op.Op.get("tl.philox_rand"),

T.address_of(buffer[0]),

total_elems,

seed,

n_rounds,

)

T.evaluate(

T.call_intrin(

"handle",

tir.op.Op.get("tl.philox_rand"),

T.address_of(buffer[0]),

total_elems,

seed,

n_rounds,

)

)

🤖 Prompt for AI Agents

In tilelang/language/random.py around lines 7 to 14, the T.call_intrin invocation builds a PrimExpr but is never emitted; wrap the intrinsic call in T.evaluate so the call becomes a side-effecting statement in the TIR. Replace the bare T.call_intrin(...) expression with T.evaluate(T.call_intrin(...)) (or assign the result to a throwaway var via T.evaluate) so the random kernel is actually invoked at codegen time.

coderabbitai · 2025-11-06T14:29:12Z

tilelang/language/random.py

+    T.call_intrin(
+        "handle",
+        tir.op.Op.get("tl.philox_rand"),
+        T.address_of(buffer[0]),


⚠️ Potential issue | 🔴 Critical

Index all buffer dimensions when taking the address.

buffer[0] only works for 1D buffers; multi-dimensional buffers require one index per dimension, otherwise script lowering fails. Emit a zero-tuple so the API works for general shapes.

Apply this diff after the T.evaluate change:

- T.address_of(buffer[0]), + T.address_of(buffer[tuple(0 for _ in buffer.shape)]),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T.address_of(buffer[0]),

T.address_of(buffer[tuple(0 for _ in buffer.shape)]),

🤖 Prompt for AI Agents

In tilelang/language/random.py around line 10, the code takes address_of(buffer[0]) which only indexes 1D buffers; change it to index the buffer with an empty/zero-length index tuple so all dimensions are handled (construct an empty index tuple and use it to index the buffer before calling T.address_of), applying this after the T.evaluate change so lowering works for multi-dimensional buffers.

- but 2d has some problems

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

testing/python/language/test_tilelang_language_rand.py (2)
23-28: Address the unused parameter and mask logic issues identified in the previous review.

As noted in the previous review, this function has two issues:

The dtype parameter is declared but never used

The mask logic offset < M is misleading for multi-block scenarios

Please refer to the previous review comment for the suggested fix.

31-40: Fix dtype inconsistency and address issues from the previous review.

In addition to the issues identified in the previous review (unused tensor allocation, incorrect tl.constexpr usage), there's a dtype inconsistency:

Line 33 explicitly uses torch.float32 for the TileLang result (though it's immediately reassigned)

Line 36 uses getattr(torch, dtype) for the Triton result

While line 34 reassigns tilelang_result, this inconsistency could be confusing. Ensure both implementations use the same dtype.

Apply this diff to address all issues:
 def run_rand_1d(M=1024, seed=42, dtype="float32", device="cuda"):
     tilelang_kernel = tilelang_rand_1d(M=M, seed=seed, dtype=dtype)
-    tilelang_result = torch.empty(M, dtype=torch.float32, device=device)
     tilelang_result = tilelang_kernel()

     triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
     grid = (1,)
-    BLOCK = tl.constexpr(M)
-    triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype))
+    triton_rand_1d[grid](triton_result, M, seed=seed, TOTAL_SIZE=M)
     torch.testing.assert_close(tilelang_result, triton_result, atol=1e-3, rtol=1e-3)

🧹 Nitpick comments (1)

testing/python/language/test_tilelang_language_rand.py (1)
14-18: Consider simplifying the indexing for the single-block kernel.

Since the kernel uses only 1 block (line 14), bx will always be 0, making the expression bx * M + i equivalent to just i. For clarity, you can simplify the indexing.
     @T.prim_func
     def rand_kernel(A: T.Tensor((M,), dtype)):
         with T.Kernel(1, threads=1024) as bx:
             rand_buffer = T.alloc_fragment((M,), dtype)
             T.rand(rand_buffer, seed)
             for i in T.Parallel(M):
-                A[bx * M + i] = rand_buffer[i]
+                A[i] = rand_buffer[i]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d63dfbb and 2a9dcec.

📒 Files selected for processing (2)

src/tl_templates/cuda/random.h (1 hunks)
testing/python/language/test_tilelang_language_rand.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/tl_templates/cuda/random.h

🧰 Additional context used

🧬 Code graph analysis (1)

testing/python/language/test_tilelang_language_rand.py (6)

tilelang/language/allocate.py (2)

alloc_fragment (59-70)

alloc_shared (27-42)

tilelang/language/random.py (1)

rand (17-22)

tilelang/language/parallel.py (1)

Parallel (9-29)

tilelang/transform/pass_config.py (1)

PassConfigKey (6-144)

tilelang/language/copy.py (1)

copy (11-87)

tilelang/language/tir/op.py (1)

if_then_else (3014-3044)

🪛 Ruff (0.14.3)

testing/python/language/test_tilelang_language_rand.py

24-24: Unused function argument: dtype

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)

🔇 Additional comments (2)

testing/python/language/test_tilelang_language_rand.py (2)

96-109: Note: Different block sizes between Triton and TileLang implementations.

The Triton kernel uses BLOCK_SIZE=1024 (line 104), while the TileLang kernel uses block_m=64 (line 71 in tilelang_seeded_dropout). While this shouldn't affect correctness if both implementations are correct, it's worth noting that they process data in different granularities.

Additionally, this test will likely fail due to the seed handling issue identified in the tilelang_seeded_dropout function, where all blocks use the same seed.

112-119: LGTM: Test structure is appropriate.

The test function is properly decorated with @tilelang.testing.requires_cuda and calls both test runners. The main guard follows standard Python conventions.

Note that the test will need the issues identified in earlier comments to be resolved before it can pass reliably.

testing/python/language/test_tilelang_language_rand.py

coderabbitai · 2025-11-06T18:35:25Z

testing/python/language/test_tilelang_language_rand.py

+            output_buffer = T.alloc_fragment((block_m,), dtype)
+
+            T.copy(A[block_start], A_shared)
+            T.rand(rand_buffer, seed)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: Same seed used across all blocks produces identical random sequences.

Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. Unlike the Triton reference (line 58), which passes element-specific offsets to ensure unique random numbers, this implementation will generate the same random sequence in each block. This breaks the dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.

Verify whether the TileLang rand implementation automatically accounts for block/thread IDs to generate unique sequences:

🏁 Script executed:

#!/bin/bash # Check if _rand_parallel_impl uses block/thread context for offset # Search for the implementation of _rand_parallel_impl ast-grep --pattern 'def _rand_parallel_impl($$$)' # Also search for how seed/offset is handled in random.py rg -n -A 10 '_rand_parallel_impl'

Length of output: 1226

--- 🏁 Script executed: ```shell #!/bin/bash # Find the definition of T.rand function # Search in language module for rand function rg -n 'def rand' --type=py # Also check if it's an alias or imported rg -n 'rand\s*=' --type=py -A 2 -B 2 # Search for rand_buffer allocation in test file rg -n 'rand_buffer' --type=py -B 5 -A 5

Length of output: 3787

🏁 Script executed:

#!/bin/bash # Also search more broadly for rand in the tilelang module rg -n '\bT\.rand' --type=py -C 3

Length of output: 1327

--- 🏁 Script executed: ```shell #!/bin/bash # Get the full definition of T.rand including docstring rg -n 'def rand' --type=py -A 15 tilelang/language/random.py

Length of output: 209

Critical: Same seed used across all blocks produces identical random sequences.

Line 82 calls T.rand(rand_buffer, seed) with the same seed in every block. The T.rand function has no offset parameter to differentiate between blocks. Unlike the Triton reference (lines 23-28), which passes element-specific offsets to ensure unique random numbers, all blocks will generate identical random sequences. This breaks dropout functionality, as elements in different blocks at the same local offset will have identical dropout masks.

To fix: Modify T.rand to accept an offset parameter (like Triton), or compute different seeds/offsets per block using block IDs before calling T.rand.

🤖 Prompt for AI Agents

In testing/python/language/test_tilelang_language_rand.py around line 82, calling T.rand(rand_buffer, seed) with the same seed in every block produces identical random sequences across blocks; change the call so each block uses a distinct offset/seed: either update T.rand to accept an additional offset parameter (like Triton) and pass a per-element/per-block offset computed from block IDs, or compute a unique seed per block (e.g., seed + block_id or seed ^ block_id) before calling T.rand; ensure the offset/seed computation uses the thread/block indices available in the kernel so different blocks generate independent random streams.

Cunxiao2002 · 2025-11-07T18:06:10Z

It seems there are still some issues when the buffer is 2D...

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

tilelang/language/random.py (1)
13-21: Emit the intrinsic and supply a writable pointer.

Line 13: Without wrapping the T.call_intrin in T.evaluate, the call is dropped during lowering, so the Philox generator never runs. Line 16: buffer.access_ptr("r") returns a read-only pointer, which cannot legally be passed to a float* parameter and will fail compilation once CUDA codegen instantiates the call. Please wrap the intrinsic in T.evaluate and request a read/write pointer before passing it to the device helper.
-    T.call_intrin(
-        "handle",
-        tir.op.Op.get("tl.philox_rand"),
-        buffer.access_ptr("r"),
-        total_elems,
-        block_m,
-        block_n,
-        seed,
-        n_rounds,
-    )
+    T.evaluate(
+        T.call_intrin(
+            "handle",
+            tir.op.Op.get("tl.philox_rand"),
+            buffer.access_ptr("rw"),
+            total_elems,
+            block_m,
+            block_n,
+            seed,
+            n_rounds,
+        )
+    )
src/tl_templates/cuda/random.h (1)
44-49: Correct the uint32→float conversion.

Line 44: Reinterpreting the value as signed, taking an absolute value, and scaling by 1/2³¹ both introduces UB (on INT32_MIN) and produces a biased distribution. The canonical mapping to [0, 1) is simply x * 2.3283064365386963e-10f.
 TL_DEVICE float uint32_to_uniform_float_device(uint32_t x) {
-  const float scale = 4.6566127342e-10f;
-  int32_t x_int32;
-  memcpy(&x_int32, &x, sizeof(uint32_t));
-  int32_t x_abs = (x_int32 < 0) ? (-x_int32 - 1) : x_int32;
-  return (float)x_abs * scale;
+  return x * 2.3283064365386963e-10f;
 }
testing/python/language/test_tilelang_language_rand.py (2)
24-28: Fix Triton mask and drop the unused dtype parameter.

Lines 24-28: Using mask=offset < M only covers the first block; later blocks never store anything. The dtype kernel argument is also unused. Please replace the mask with a check against the overall element count and remove the dead parameter.
-@triton.jit
-def triton_rand_1d(X, M, seed, dtype: tl.constexpr):
+@triton.jit
+def triton_rand_1d(X, M, seed, total_size):
     pid = tl.program_id(0)
     offset = pid * M + tl.arange(0, M)
     rand = tl.rand(seed, offset)
-    tl.store(X + offset, rand, mask=offset < M)
+    tl.store(X + offset, rand, mask=offset < total_size)
31-39: Update the Triton call site to pass the total size.

Lines 31-39: After fixing the Triton kernel signature, the launch still passes the old arguments. Pass the total element count and drop the redundant tl.constexpr wrapper.
-    triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
-    grid = (1,)
-    BLOCK = tl.constexpr(M)
-    triton_rand_1d[grid](triton_result, BLOCK, seed=seed, dtype=getattr(tl, dtype))
+    triton_result = torch.empty(M, dtype=getattr(torch, dtype), device=device)
+    grid = (1,)
+    triton_rand_1d[grid](triton_result, M, seed=seed, total_size=M)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a9dcec and ebd64c3.

📒 Files selected for processing (3)

src/tl_templates/cuda/random.h (1 hunks)
testing/python/language/test_tilelang_language_rand.py (1 hunks)
tilelang/language/random.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

testing/python/language/test_tilelang_language_rand.py (4)

src/tl_templates/cuda/random.h (1)

tl (5-86)

tilelang/language/allocate.py (2)

alloc_fragment (59-70)

alloc_shared (27-42)

tilelang/language/random.py (1)

rand (25-40)

tilelang/language/tir/op.py (1)

if_then_else (3014-3044)

tilelang/language/random.py (2)

src/op/builtin.h (1)

tvm (13-577)

tilelang/language/tir/op.py (1)

call_intrin (120-145)

🪛 Ruff (0.14.3)

testing/python/language/test_tilelang_language_rand.py

24-24: Unused function argument: dtype

(ARG001)

114-114: Unused function argument: dtype

(ARG001)

tilelang/language/random.py

12-12: Unused function argument: dtype

(ARG001)

37-38: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

coderabbitai · 2025-11-07T18:12:38Z

src/tl_templates/cuda/random.h

+  int elems_per_thread = (total_elems + num_threads - 1) / num_threads;
+
+  for (int i = 0; i < elems_per_thread; i++) {
+    int local_linear_idx = tid * elems_per_thread + i;
+    int local_row = local_linear_idx / block_n;
+    int local_col = local_linear_idx % block_n;
+
+    int global_row = block_row_offset + local_row;
+    int global_col = block_col_offset + local_col;
+    uint64_t global_idx = (uint64_t)global_row * N + global_col;
+
+    uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
+    uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
+    uint32_t c2 = 0U;
+    uint32_t c3 = 0U;
+
+    philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
+    output[i] = uint32_to_uniform_float_device(c0);
+  }


⚠️ Potential issue | 🔴 Critical

Fix overlapping writes and guard the tail.

Lines 65-83: Every thread writes to output[i], so threads clobber each other’s results. When total_elems < blockDim.x * elems_per_thread, local_linear_idx also exceeds the valid range, causing out-of-bounds writes. Guard the loop with local_linear_idx < total_elems and index the buffer with local_linear_idx.

- for (int i = 0; i < elems_per_thread; i++) { - int local_linear_idx = tid * elems_per_thread + i; + for (int i = 0; i < elems_per_thread; i++) { + int local_linear_idx = tid * elems_per_thread + i; + if (local_linear_idx >= total_elems) { + break; + } int local_row = local_linear_idx / block_n; int local_col = local_linear_idx % block_n; int global_row = block_row_offset + local_row; int global_col = block_col_offset + local_col; uint64_t global_idx = (uint64_t)global_row * N + global_col; uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL); uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL); uint32_t c2 = 0U; uint32_t c3 = 0U; philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds); - output[i] = uint32_to_uniform_float_device(c0); + output[local_linear_idx] = uint32_to_uniform_float_device(c0); }

🤖 Prompt for AI Agents

In src/tl_templates/cuda/random.h around lines 65 to 83, threads currently write to output[i] unguarded which causes overlapping writes and potential out-of-bounds when local_linear_idx >= total_elems; modify the loop to check that local_linear_idx < total_elems before doing work and write into output[local_linear_idx] (not output[i]) so each thread writes to the correct global slot and the tail is safely guarded.

coderabbitai · 2025-11-07T18:12:39Z

testing/python/language/test_tilelang_language_rand.py

+            T.copy(A[block_start], A_shared)
+            T.rand(rand_buffer, seed)
+
+            for i in T.Parallel(block_m):
+                output_buffer[i] = T.if_then_else(
+                    rand_buffer[i] > p,
+                    A_shared[i] / (1.0 - p),
+                    0.0,
+                )
+
+            T.copy(output_buffer, B[block_start])


⚠️ Potential issue | 🔴 Critical

Guard loads/stores in the last dropout block.

Lines 81-91: T.copy(A[block_start], A_shared) and T.copy(output_buffer, B[block_start]) always move block_m elements. When M % block_m != 0, the final block reads and writes past the end of A/B, leading to undefined behaviour. Please compute the remaining element count (e.g. valid = T.min(block_m, M - block_start)) and guard both copies and the parallel loop so the tail block only touches valid indices.

🤖 Prompt for AI Agents

In testing/python/language/test_tilelang_language_rand.py around lines 81 to 91, the copies and the parallel loop always move block_m elements which overruns A/B when M % block_m != 0; compute valid = T.min(block_m, M - block_start) and use that to bound all tail operations: copy only valid elements from A to A_shared and from output_buffer back to B, and change the T.Parallel loop to iterate over valid instead of block_m so rand_buffer/output_buffer and the conditional write only touch indices < valid.

coderabbitai · 2025-11-07T18:12:39Z

testing/python/language/test_tilelang_language_rand.py

+        with T.Kernel(T.ceildiv(M, block_m), T.ceildiv(N, block_n), threads=64) as (bx, by):
+            rand_buffer = T.alloc_fragment((block_m, block_n), dtype)
+            T.rand(rand_buffer, seed)
+            T.copy(rand_buffer, A[bx * block_m, by * block_n])
+


⚠️ Potential issue | 🔴 Critical

Handle partial tiles in the 2D generator.

Lines 138-142: T.copy(rand_buffer, A[bx * block_m, by * block_n]) blindly writes block_m * block_n elements. For matrices whose dimensions aren’t multiples of block_m/block_n, the last tiles run out of bounds. Add per-dimension guards (similar to the Triton mask) or clamp the copy extents to the remaining rows/columns before issuing the copy.

LeiWang1999 · 2025-11-08T14:04:41Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-08T14:08:03Z

src/op/builtin.cc

+TIR_DEFINE_TL_BUILTIN(philox_rand)
+    .set_num_inputs(4)
+    .set_attr<TCallEffectKind>("TCallEffectKind",


Register philox_rand with correct input count

The new intrinsic is registered with set_num_inputs(4) even though every caller passes six arguments (buffer pointer, total_elems, block_m, block_n, seed, n_rounds) and the CUDA helper tl::philox_rand also takes six parameters. TVM’s intrinsic registry enforces num_inputs, so this mismatch will trigger a check failure when T.rand is lowered, making the feature unusable. The intrinsic should accept six inputs (or -1 for variadic) to match the call sites.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-08T14:08:03Z

src/tl_templates/cuda/random.h

+  int elems_per_thread = (total_elems + num_threads - 1) / num_threads;
+
+  for (int i = 0; i < elems_per_thread; i++) {
+    int local_linear_idx = tid * elems_per_thread + i;
+    int local_row = local_linear_idx / block_n;
+    int local_col = local_linear_idx % block_n;
+
+    int global_row = block_row_offset + local_row;
+    int global_col = block_col_offset + local_col;
+    uint64_t global_idx = (uint64_t)global_row * N + global_col;
+
+    uint32_t c0 = (uint32_t)(global_idx & 0xFFFFFFFFULL);
+    uint32_t c1 = (uint32_t)((global_idx >> 32) & 0xFFFFFFFFULL);
+    uint32_t c2 = 0U;
+    uint32_t c3 = 0U;
+
+    philox_impl_device(&c0, &c1, &c2, &c3, seed_lo, seed_hi, n_rounds);
+    output[i] = uint32_to_uniform_float_device(c0);


Store Philox results into unique buffer indices

Inside the CUDA helper philox_rand each thread computes local_linear_idx for the element it owns, but the generated random value is written to output[i] instead of output[local_linear_idx]. When more than one thread participates this causes every thread to overwrite the same first few positions, leaving most of the buffer uninitialized and yielding repeated numbers when copied back to global memory. It can also write past the end of the buffer when total_elems is not a multiple of the thread count. The output pointer should be indexed by local_linear_idx (with a bounds check) so each thread writes to its own slice.

Useful? React with 👍 / 👎.

cc @Cunxiao2002

Thanks! I will take a look.

Cunxiao2002 added 2 commits November 5, 2025 20:54

[Language]Adds a random number generation capability to TileLang.

22dc795

introduce T.rand

Merge branch 'main' into random

ec83180

format

a8d70a2

Cunxiao2002 marked this pull request as draft November 5, 2025 12:59

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

Cunxiao2002 added 3 commits November 6, 2025 14:25

Refactors random number generation to use CUDA device code

d892975

Merge branch 'main' into random

760536a

Refactors CUDA random number generation implementation

2474549

LeiWang1999 marked this pull request as ready for review November 6, 2025 09:40

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

Cunxiao2002 added 2 commits November 6, 2025 21:25

rafactor to use call_intrin

03b5aa5

format

d63dfbb

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

support 1d random generation

2a9dcec

- but 2d has some problems

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

refactor random generator cuda kernel

ebd64c3

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 8, 2025

View reviewed changes

[fix] fix correctness issues in the vectorized instruction path

e381855

	T.address_of(buffer[0]),
	T.address_of(buffer[tuple(0 for _ in buffer.shape)]),

[Language]Adds a random number generation capability #1196

Are you sure you want to change the base?

[Language]Adds a random number generation capability #1196

Conversation

Cunxiao2002 commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Cunxiao2002 commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

LeiWang1999 commented Nov 8, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Cunxiao2002 commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading