Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Apr 5, 2021

Rendered

This is not a complete RFC in the traditional sense, but reflects our current thinking related to the work @asuhan has been doing in torch_xla and how this should affect PyTorch core proper. This doc predates pytorch/xla#2854 but seeing mlir-npcomp pursuing a similar avenue I wanted to get this RFC out there so people outside of FB can have more context about what is going on in XLA internally.

@ezyang
Copy link
Contributor Author

ezyang commented Apr 5, 2021

also cc @wconstab

@byronyi
Copy link

byronyi commented Apr 12, 2021

It seems that torch-xla relies on XLA for shape inference, and PyTorch access the sizes/numel/dim after each op. If we strip out XLA, this probably need to come from somewhere else.

I also noticed that you added Meta device for shape inference. Does lazy tensor depend on shape inference, or in particular, meta tensors implemented in-tree?

@asuhan
Copy link

asuhan commented Apr 13, 2021

@byronyi If we strip out XLA, shape inference can come from core PyTorch. Some of the helpers might need to be exposed, but they're already written. The XLA inference was just more convenient since we already had to lower to it and could avoid making changes to the core, but the shape inference results are exactly the same and we can substitute the one in core for the one provided via XLA.

In terms of access to sizes/numel/dim, that works exactly the same with or without XLA, not sure what you mean. What's your concern there?

Finally, lazy tensors don't depend deeply on shape inference or even meta tensors. There's a moderate amount of work to be done for a few operators which currently use static sizes, but nothing fundamental in the design assumes or requires static shapes. We'll still require static ranks, but that should be flexible enough for all practical applications of this.

@byronyi
Copy link

byronyi commented Apr 13, 2021

Some of the helpers might need to be exposed, but they're already written.

Would appreciate if you could give me a pointer here to core :)

In terms of access to sizes/numel/dim, that works exactly the same with or without XLA, not sure what you mean. What's your concern there?

I am just wondering if it is possible to lazily evaluate shapes of lazy tensors as well.

Finally, lazy tensors don't depend deeply on shape inference or even meta tensors. There's a moderate amount of work to be done for a few operators which currently use static sizes, but nothing fundamental in the design assumes or requires static shapes. We'll still require static ranks, but that should be flexible enough for all practical applications of this.

If shapes must be materialized before actually computing the value of lazy tensors, then shape inference seems to be a must to me.

@asuhan
Copy link

asuhan commented Apr 13, 2021

Some of the helpers might need to be exposed, but they're already written.

Would appreciate if you could give me a pointer here to core :)

One such example: https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/aten/src/ATen/native/ConvUtils.h#L25. They're not centralized, but they're quite easy to find and most interesting / difficult ones tend to be exposed already.

In terms of access to sizes/numel/dim, that works exactly the same with or without XLA, not sure what you mean. What's your concern there?

I am just wondering if it is possible to lazily evaluate shapes of lazy tensors as well.

I think you're conflating two separate aspects here:

  1. Shape inference used to validate operations in core PyTorch. That doesn't change in any way because lazy tensors are still ATen tensors and that part still runs before we even hit the vendor back-end. At that stage, sizes are queried directly, just like it happens for CPU or CUDA devices.
  2. How to generate code when shapes aren't known when generating code for a graph. That's on the vendor to implement in the compiler back-end, in this tracing infrastructure we'll just have a mode which erases concrete shapes before passing them to the vendor-specific back-end. Once we know that the concrete shapes were valid, we erase them so that we don't trigger re-compilations for the same graph with different concrete sizes, as long as it's structurally the same otherwise. It's then up to the vendor to build a runtime which includes a way to query shapes, hoist / elide bounds checking when redundant etc. For example, for the computation x + y, once we know from the higher layer that the shapes were compatible, we just tell the vendor we have aten::add(x, y) and the rank of the two tensors. The vendor will pick a representation which includes the shape information in their compiler backend and add runtime calls to query it in order to know the upper bound of the loops etc.

Finally, lazy tensors don't depend deeply on shape inference or even meta tensors. There's a moderate amount of work to be done for a few operators which currently use static sizes, but nothing fundamental in the design assumes or requires static shapes. We'll still require static ranks, but that should be flexible enough for all practical applications of this.

If shapes must be materialized before actually computing the value of lazy tensors, then shape inference seems to be a must to me.

I think I've addressed this above. Shapes don't need to be materialized, but they need to be queried with a runtime mechanism which relies on the representation the vendor picks to represent an operator. How the vendor does that is entirely outside of our control since we don't even know how the vendor will represent runtime values. We do know, though, that any operand in the graph will be a pair of value and shape, there'll be a way to associate concrete sizes from the inputs with symbolic shapes in the graph at launch time and there'll be a way to query those, so we're certain it can be done. In our x + y example, it'll go like this:

  1. We validate x + y in higher layers with concrete shapes. We erase everything but the rank and hand the aten::add(x, y) graph to the vendor.
  2. At launch time, the vendor associates concrete sizes of x and y with symbolic variables in the generated code for the graph which isn't specific to a concrete size.
  3. These symbolic variables are used in the for loop generated for the aten::add operation and are queried at runtime from the representation the vendor picked. That representation is filled in for x and y at launch time with the concrete size values for that launch.

@ezyang
Copy link
Contributor Author

ezyang commented Apr 13, 2021

To add some of my own perspective on top of @asuhan's:

  • For size specialized traces (the easy case), asuhan advocates for directly using the size helper functions. This is a good super short term fix. A more mid term fix is to indeed use meta tensors, which offer a uniform API for doing shape inference (just convert both input tensors to meta, run the operation on them, and you get a meta tensor that tells you what the output shape should be. Or when Using a PyTorch-core codegen API xla#2871 is a thing we can support codegen facilities to hook into meta computation in a more efficient way). This doesn't save you too much since you have to know a bit about the operators anyway to lower them in a backend, but it is something.
  • Based on what @jamesr66a tells me about lazy shape evaluation in FX, you are going to hit some hard walls when a model takes a size somewhere and reads it out and passes it as an integer to some other operation. No matter what you do, XLA-style functional traces in the dispatcher will give you int, and not a symbolic int (because we have C++ code that literally takes int64_t). We're talking about some long term strategies for solving this, but it won't be for a while.

@asuhan
Copy link

asuhan commented Apr 13, 2021

  • Based on what @jamesr66a tells me about lazy shape evaluation in FX, you are going to hit some hard walls when a model takes a size somewhere and reads it out and passes it as an integer to some other operation. No matter what you do, XLA-style functional traces in the dispatcher will give you int, and not a symbolic int (because we have C++ code that literally takes int64_t). We're talking about some long term strategies for solving this, but it won't be for a while.

In the model I'm describing, this would translate to an unavoidable (even assuming an infinitely smart compiler) size query runtime call in the vendor backend. By construction it cannot propagate to any dimension size since we erase those (if the vendor chooses the dynamic mode). In other words, it'd only be present during shape checks in higher layers, but once we know those pass we erase the sizes and the vendor backend is responsible to emit the proper runtime queries. It's possible the graph would also change, but I'm not sure how frequent that would be in practice.

Tracing is part of the solution and not a full solution. It'll work very well for some models by itself, but we'll definitely need to complement it with TorchScript annotations in parts of models which are vulnerable to worst aspects of tracing. However, in terms of correctness, I'm pretty sure what I'm describing is sound.

@byronyi
Copy link

byronyi commented Apr 13, 2021

For size specialized traces (the easy case), asuhan advocates for directly using the size helper functions.

So these size helper functions (for each and every operator in PT core) will be extracted into a central place in a new repo (for lazy tensors)? I am asking this because they seem pretty sparse to me, but maybe that is only because I am not that familiar to the codebase.

you are going to hit some hard walls when a model takes a size somewhere and reads it out and passes it as an integer to some other operation.

I am actually okay to occasional dynamically shaped/ranked ops as they are probably going to be handled by some fallback logic (i.e. host cpu in XLA case) anyway. But I do hope we could get static/flat trace for a spanning list of uninteresting ops, element-wise, mm, etc., so vendor backends need not to be queried for output shape for each op in the functional lazy trace.

Just as a illustration, for the following wrapper TensorImpl:

struct XLATensorImpl : public c10::TensorImpl {
  explicit XLATensorImpl(at::Tensor rep)
      : TensorImpl(c10::DispatchKeySet(c10::DispatchKey::XLA), rep.dtype(),
                   c10::Device("xla:0")),
        rep_(std::move(rep)) {
    if (rep_.defined()) {
      set_sizes_and_strides(rep_.sizes(), rep_.strides());
    }
    set_storage_access_should_throw();
  }

  at::IntArrayRef sizes() const override {
    std::cout << "Accessing sizes " << rep_.sizes() << std::endl;
    return rep_.sizes();
  }

  int64_t dim() const override {
    std::cout << "Accessing dim " << rep_.dim() << std::endl;
    return rep_.dim();
  }

  int64_t numel() const override {
    std::cout << "Accessing numel " << rep_.numel() << std::endl;
    return rep_.numel();
  }
};

A wrapper-based backend fallback to CPU gives us the following traces:

aten::convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups) -> (Tensor)
Accessing sizes [4, 10, 24, 24]
Accessing numel 23040
Accessing dim 4
Accessing dim 4
Accessing sizes [4, 10, 24, 24]
Accessing sizes [4, 10, 24, 24]
Accessing numel 23040
Accessing dim 4
Accessing dim 4
aten::max_pool2d_with_indices(Tensor self, int[2] kernel_size, int[2] stride=[], int[2] padding=[0, 0], int[2] dilation=[1, 1], bool ceil_mode=False) -> (Tensor, Tensor)
Accessing sizes [4, 10, 12, 12]
Accessing numel 5760
Accessing dim 4
Accessing dim 4
Accessing sizes [4, 10, 12, 12]
Accessing numel 5760
Accessing dim 4
Accessing dim 4
Accessing sizes [4, 10, 12, 12]
Accessing sizes [4, 10, 12, 12]
Accessing numel 5760
Accessing dim 4
Accessing dim 4
Accessing sizes [4, 10, 12, 12]
Accessing numel 5760
Accessing dim 4
Accessing dim 4
aten::relu(Tensor self) -> (Tensor)
Accessing sizes [4, 10, 12, 12]
Accessing numel 5760
Accessing dim 4
Accessing dim 4
Accessing sizes [4, 10, 12, 12]
Accessing dim 4

Ideally the functional lazy traces need not to run the operator on CPU to get sizes and strides. Right now in torch-xla it is supplied by XLA shape inference.

@asuhan
Copy link

asuhan commented Apr 13, 2021

For size specialized traces (the easy case), asuhan advocates for directly using the size helper functions.

So these size helper functions (for each and every operator in PT core) will be extracted into a central place in a new repo (for lazy tensors)? I am asking this because they seem pretty sparse to me, but maybe that is only because I am not that familiar to the codebase.

PT/XLA already handles a fraction of operators, not the full set - not all operators occur in real models, especially if we're talking performance-sensitive paths. Further, only some of those have non-trivial shape equations. I've checked most in the past and I remember finding usable helpers for nearly all of them. I agree with @ezyang that meta-tensors are probably the way to go about this given a longer time horizon, but the situation isn't dire on the short term either. I wouldn't bother to centralize the helpers but rather bring up meta tensors.

I am actually okay to occasional dynamically shaped/ranked ops as they are probably going to be handled by some fallback logic (i.e. host cpu in XLA case) anyway. But I do hope we could get static/flat trace for a spanning list of uninteresting ops, element-wise, mm, etc., so vendor backends need not to be queried for output shape for each op in the functional lazy trace.

The vendor can hoist / elide unnecessary checks for sequences of element-wise operations, for example. It's just a compiler optimization pass in their back-end. We could probably offer some basic, generic tools to assist some of that, but this competes against everything else we want to get done and isn't particularly hard to implement on the vendor side either. But you're right it'd be nice to have it, I'm open-minded about it.

Ideally the functional lazy traces need not to run the operator on CPU to get sizes and strides. Right now in torch-xla it is sup plied by XLA shape inference.

That would be the idea, you'd never run an operator on CPU just to get sizes and strides. A vendor could choose to punt to CPU whenever a runtime size query would be needed, but that's an implementation quality issue - a good implementation would come with a runtime (native to accelerator) to support such queries on the value representation chosen by a vendor. Think metadata similar to CPU or GPU tensors, but stored in the accelerator memory, nothing prevents that. Taken to extreme, a pure interpreter for the tensor computation graph can run on the accelerator itself (not saying that'd be a truly good option, just making an informal argument that what I'm describing is feasible). I'm assuming a general purpose core (probably quite slow in absolute terms) which can execute shape computation and run a simple memory allocator is present on the accelerator, which I think it's a fairly safe assumption for training accelerators.

@ezyang
Copy link
Contributor Author

ezyang commented Apr 14, 2021

So these size helper functions (for each and every operator in PT core) will be extracted into a central place in a new repo (for lazy tensors)? I am asking this because they seem pretty sparse to me, but maybe that is only because I am not that familiar to the codebase.

So, there are two main APIs we are envisioning how to access sizes as we add more support for structured kernels.

First is the direct (but somewhat inefficient) API. Suppose that you are implementing "add" in XLA and you to compute what the output shape should be, you can write this:

Tensor add(const Tensor& self, const Tensor& other) {
    auto out_meta = at::add(self.to(kMeta), other.to(kMeta));
    ... out_meta.sizes() ...
}

The add call will do all the error checking you need on the inputs, and give you an output type that says what the output sizes, dtype, etc. should be. You can then make use of this information as necessary for your error checking. (NB: the output size isn't that useful, for the reason that if you actually are implementing a lowering you will have a far more detailed understanding of what the operator does, but it might still be helpful.

Second is a more pimped up version of @bdhirsh's proposal at pytorch/xla#2871 where we generate the scaffolding that calls into meta, so by the time you are writing lowerings you can assume all error checking has already happened. (We haven't really gotten that far in the design process here).

Ideally the functional lazy traces need not to run the operator on CPU to get sizes and strides. Right now in torch-xla it is sup plied by XLA shape inference.

Yeah, so structured kernels are trying to make it so that you can easily get the sizes and strides from PyTorch directly, rather than assuming you already have an accurate lowering. Coverage is not very high right now but getting better!

@byronyi
Copy link

byronyi commented Apr 14, 2021

Yeah, so structured kernels are trying to make it so that you can easily get the sizes and strides from PyTorch directly, rather than assuming you already have an accurate lowering. Coverage is not very high right now but getting better!

Thanks, I was just reading through https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#goals and that is exactly what I am asking for. We'd love to help on this effort!

@ezyang
Copy link
Contributor Author

ezyang commented Apr 14, 2021

There's a nice tracker issue on the subject :) pytorch/pytorch#55070

@byronyi
Copy link

byronyi commented Apr 21, 2021

I see @asuhan pushed to pytorch/pytorch@fafb8ab and corresponding changes to torch-xla side: https://github.com/pytorch/xla/tree/asuhan/xla_ltc_plugin. Nice one!

I do see a TODO on shape inference: https://github.com/pytorch/pytorch/blob/lazy_tensor_staging/lazy_tensor_core/lazy_tensor_core/csrc/compiler/node_lowering.h#L19-L20. Do you plan to provide such facilities from core using meta tensors?

@asuhan
Copy link

asuhan commented Apr 21, 2021

@byronyi On the short term I'll just expose / extract the required helpers from core and use them to provide shape inference. Longer term, yes, I really hope we'll be doing that with meta tensors instead - it'd be much nicer.

wconstab and others added 3 commits May 10, 2021 11:09
Generalized a bit (from xla-specific) and added some context around backend integration API.
@yzhliu
Copy link

yzhliu commented May 24, 2021

@ezyang @asuhan We are working on similar approach to integrate our compiler backend with pytorch. It's great to see the work you are doing. Do you have timeline of when and which part will be moved into pytorch core? Also feel free to let us know if there's anything we can help.

Moreover do you have plan to integrate LazyTensor together with Torchscript so we can add extra layer to scripted modules, as @hzfan mentioned in pytorch/xla#2957 (comment) ?

cc @mli

@asuhan
Copy link

asuhan commented May 24, 2021

@yzhliu We intend to put this part into the core, the lazy_tensor_core sub-folder: https://github.com/pytorch/pytorch/tree/lazy_tensor_staging. This was derived from the "upper half" of pytorch/xla - it offers lazy tensor infrastructure which is independent on the actual backend.

We have a working XLA backend: https://github.com/pytorch/xla/tree/asuhan/xla_ltc_plugin, derived from the "lower half" of pytorch/xla and I'm working on a TorchScript backend too - that'll allow us to reuse nvFuser and other TorchScript backends more easily, while providing the graph capture mechanisms and infrastructure.

We have ideas to use TorchScript together with lazy tensors (NB: this is different from TorchScript as a backend mentioned above) to address the caveats - undesired loop unrolling and other similar issues. The idea would be to use it sparingly in problem areas (which have control flow) and still lean on lazy tensors for most of the model, which would maintain usability.

The plan is to become ready to merge lazy_tensor_core to master by July-August, which might be ambitious but should be doable at least. There are several things which need to happen: code base reduction (we can autogenerate most nodes in csrc/ops), removing the last dependencies on absl (we have a handful of uses of StrCat and Span, which we can re-implement) etc. That being said, I'm trying to keep that branch reasonably fresh, 1-2 weeks away from master, until we can finally merge it.

@yzhliu
Copy link

yzhliu commented May 26, 2021

@asuhan Thanks for sharing the details and the estimated timeline. We'll keep eyes on the project and might bother you folks later :) as we start to make progress.

@nunoplopes
Copy link

Only saw this RFC now, sorry.
I've been working on a lazy tensor implementation as well. It's fairly small (around 1k LoC + autogen code). I've managed to run a few models from TorchVision & HuggingFace already.
The IR ATM is 1-to-1 with ATen ops. The backend just redispatches op-by-op back to PyTorch. Will hook up with TorchScript next.

The complications I've hit so far:

  • The dtype must be inferred, as the dtype getter is not virtual.
  • it's nice to infer the shape, as otherwise traces are too small in some models.
  • Not all accesses in PyTorch of shape information go through the virtual methods (bugs)

For the dtype & shape inference, I wrote a program that executes all ATen ops with different combinations of shapes/dtypes and checks which of my hand-written rules applies. The dispatch wrappers are generated automatically from this information.

My code is available here: https://github.com/nunoplopes/torchy

Would you guys (FB, others?) be interested in chatting about integration, directions, share learned lessons, share data from models running already, etc?

@wconstab
Copy link

wconstab commented Jul 6, 2021

@nunoplopes Thanks for reaching out! We are also continuing on the branch lazy_tensor_staging in the subfolder lazy_tensor_core, where you can see README and API_GUIDE for more info about the current work. It would be good to compare notes on the two systems and see if we can combine the best of both. Also, just curious, what is your use case for lazy tensors?

@nunoplopes
Copy link

@nunoplopes Thanks for reaching out! We are also continuing on the branch lazy_tensor_staging in the subfolder lazy_tensor_core, where you can see README and API_GUIDE for more info about the current work. It would be good to compare notes on the two systems and see if we can combine the best of both. Also, just curious, what is your use case for lazy tensors?

I'm building a JIT compiler for PyTorch. I use lazy tensors to assemble traces and then ship those traces to an off-the-shelf compiler, like TorchScript, for optimization. Traces are cached and reused when seen again.
The goal is to offer eager-mode execution semantics, but with near old-school data-flow performance (XLA & friends).

@hzfan
Copy link

hzfan commented Nov 1, 2021

Do we have a timeline to merge lazy_tensor_staging into master?

@yzhliu
Copy link

yzhliu commented Nov 8, 2021

Hi @ezyang , as mentioned before, we are building features on top of the lazy tensor support, and we noticed the branch is being actively updated. Would you mind share the plan of making the branch officially available to pytorch users?

@wconstab
Copy link

wconstab commented Nov 8, 2021

We are in the process of merging lazy tensor support into pytorch master right now. (You can already see some files landing in torch/csrc/lazy/.... It will probably take us the rest of the year, possibly a little longer, to land the whole thing including our TorchScript backend. The core interfaces should land sooner, although, we won't consider them 'stable' right away as we are probably going to have to make changes to add support for additional features like distributed training, improve performance, etc.

The actual functionality of lazy tensors won't be enabled 'by default' for users (of cpu/cuda device) this year either. Right now, lazy is a 'virtual device' meaning you have to move your tensors to 'lazy' (or 'xla') device explicitly, and use other flags to configure the hardware used by the backend. In the future we'd like to explore making a 'lazy mode' available to existing devices (e.g. cpu, cuda).

@hzfan
Copy link

hzfan commented Nov 9, 2021

@wconstab Could you elaborate a bit more about the TorchScript backend? How does users use TorchScript together with lazy tensor core? Previously I extended ltc to support TorchScript for control flow, and I am super excited to hear that we have official support for it now.

@wconstab
Copy link

wconstab commented Nov 9, 2021

elaborate a bit more about the TorchScript backend

It's still WIP, but the idea is to construct a lazy trace of 'functional ATen ops', which are then trivially convertible to torchscript IR. The lazy-traced torchscript IR is a subset of overall TS IR capabilities, since it is functional, has no controlflow, and has no python classes. The IR is then fed to a TS GraphExecutor, and used with existing TS passes/backends.

I extended ltc to support TorchScript for control flow,

Not sure what this means. We don't support control flow in lazy tracing currently. (No specific plans to support it either, although, there is some ongoing discussion on the topic.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants