Refactor gradscaler #99301

heidongxianhua · 2023-04-17T09:37:20Z

Fixes #ISSUE_NUMBER
Now, the GradScaler related code is at torch/cuda/amp, but we think for different device (cuda/xpu, etc), the strategy to update scale should be basically the same. So it is may be better to move these code to torch/amp, so that we can inherit the GradScaler defined in torch/amp for other devices (cuda/xla/mps, ... and custom device).
And most importantly, this will not break backward. @bdhirsh @albanD

as we talked at this discuss, https://dev-discuss.pytorch.org/t/improve-the-extension-with-privateuse1-for-custom-device/1196/7
cc @mcarilli @ptrblck @leslie-fang-intel @jgong5

pytorch-bot · 2023-04-17T09:37:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99301

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94f5fee:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD · 2023-04-18T14:01:07Z

I'll let @janeyx99 take a first stab at reviewing this.

janeyx99

Thanks for working on this! Left several comments

janeyx99 · 2023-04-19T20:02:13Z

torch/cuda/amp/grad_scaler.py

+               "is deprecated. It will be removed in the future and "
+               "use torch.amp.grad_scaler._refresh_per_optimizer_state()")


Suggested change

"is deprecated. It will be removed in the future and "

"use torch.amp.grad_scaler._refresh_per_optimizer_state()")

"is deprecated. It will be removed in the future. Instead "

"use torch.amp.grad_scaler._refresh_per_optimizer_state()")

nit for clarity

janeyx99 · 2023-04-19T20:05:30Z

torch/cuda/amp/grad_scaler.py

@@ -170,7 +48,8 @@ def scale(self, outputs):
            return outputs * self._scale.to(device=outputs.device, non_blocking=True)

        # Invoke the more complex machinery only if we're treating multiple outputs.
-        stash: List[_MultiDeviceReplicator] = []  # holds a reference that can be overwritten by apply_scale
+        # holds a reference that can be overwritten by apply_scale
+        stash: List[torch.amp.grad_scaler._MultiDeviceReplicator] = []


Instead of typing out the full torch.amp.grad_scaler. every time in the file, can these be imported at the top for clarity + ease of reading?

janeyx99 · 2023-04-19T20:22:57Z

torch/amp/grad_scaler.py

+    Lazily serves copies of a tensor to requested devices.  Copies are cached per-device.
+    """
+    def __init__(self, master_tensor: torch.Tensor) -> None:
+        self.master = master_tensor


I see you removed the assertion for the device check = CUDA and XLA. We will still want to keep these during the refactoring until we can confidently say all devices are supported.

janeyx99 · 2023-04-19T20:25:54Z

torch/amp/grad_scaler.py

+    iterations.  After that, step skipping should occur rarely (once every few hundred or thousand iterations).
+
+    Args:
+        device


Please add documentation for this new variable so people know what values it can take and what it's used for!

janeyx99 · 2023-04-19T20:28:29Z

torch/amp/grad_scaler.py

+            Default: ``True``
+    """
+    def __init__(self,
+                 device="cuda",


Would prefer not defaulting to CUDA here but passing in the right value in the cuda/amp/grad_scaler for device.

Also, we may want to assert that the device is either cuda or xla here

janeyx99 · 2023-04-19T20:31:46Z

torch/amp/grad_scaler.py

+            return outputs
+
+        # Short-circuit for the common case.
+        if isinstance(outputs, torch.Tensor):


Also see the assert statement is no longer there--same comment as above about keeping the assertions until we're confident other devices work.

janeyx99 · 2023-04-19T20:33:34Z

torch/amp/grad_scaler.py

+
+            for device, per_dtype_grads in per_device_and_dtype_grads.items():
+                for grads in per_dtype_grads.values():
+                    torch._amp_foreach_non_finite_check_and_unscale_(grads,


To give more context about device support, this call, for example, only exists for CUDA.

janeyx99 · 2023-04-19T20:37:01Z

torch/amp/grad_scaler.py

+        """
+        return self._enabled
+
+    def state_dict(self):


This should also include the new device attr, along with the load_state_dict and other state related functions below

janeyx99 · 2023-04-19T20:40:31Z

torch/distributed/fsdp/sharded_grad_scaler.py

@@ -83,6 +83,7 @@ class ShardedGradScaler(GradScaler):

    def __init__(
        self,
+        device: str = "cuda",


If this is CUDA only, then shouldn't it inherit from the GradScaler from torch.cuda.amp.grad_scaler?

github-actions · 2023-06-24T15:34:04Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

heidongxianhua requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners April 17, 2023 09:37

pytorch-bot bot added the release notes: distributed (sharded) release notes category label Apr 17, 2023

github-actions bot added the module: amp (automated mixed precision) autocast label Apr 17, 2023

pytorchbot added the open source label Apr 17, 2023

heidongxianhua force-pushed the refactor_gradscaler branch 2 times, most recently from b2a76fe to 8b1f176 Compare April 17, 2023 11:08

refactor gradscaler for amp

7271793

heidongxianhua force-pushed the refactor_gradscaler branch from 8b1f176 to 7271793 Compare April 17, 2023 12:19

fix lint

94f5fee

janeyx99 self-requested a review April 18, 2023 17:21

janeyx99 reviewed Apr 19, 2023

View reviewed changes

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 25, 2023

github-actions bot added the Stale label Jun 24, 2023

github-actions bot closed this Jul 24, 2023

heidongxianhua deleted the refactor_gradscaler branch January 8, 2025 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor gradscaler #99301

Refactor gradscaler #99301

Uh oh!

heidongxianhua commented Apr 17, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 17, 2023 •

edited

Loading

Uh oh!

albanD commented Apr 18, 2023 •

edited

Loading

Uh oh!

janeyx99 left a comment

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

janeyx99 Apr 19, 2023

Uh oh!

github-actions bot commented Jun 24, 2023

Uh oh!

Uh oh!

		"is deprecated. It will be removed in the future and "
		"use torch.amp.grad_scaler._refresh_per_optimizer_state()")

Refactor gradscaler #99301

Refactor gradscaler #99301

Uh oh!

Conversation

heidongxianhua commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99301

✅ No Failures

Uh oh!

albanD commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 24, 2023

Uh oh!

Uh oh!

heidongxianhua commented Apr 17, 2023 •

edited

Loading

pytorch-bot bot commented Apr 17, 2023 •

edited

Loading

albanD commented Apr 18, 2023 •

edited

Loading