Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any recommendations about how to use a PyTorch trained model for inference? Is it best to load it up with PyTorch directly, or convert to ONNX and use ONNX-runtime [1] instead? This seems to be the required way at least if you want to TensorRT the model. I appreciate this is a very general question.

[1] https://github.com/microsoft/onnxruntime



I think using TRTorch[1] can be quick way to generate both easy to use and fast inference models from PyTorch.

It compiles your model, using TensorRT, Ahead of Time and enables you to use the compiled model through torch.jit.load("your_trtorch_model.ts") in your application. Once compiled, you no longer need to keep your model's code in the application (as for usual jit models).

The inference time is on par with TensorRT and it does the optimizations for you as well. You can quantize your model to FP16 or Int8 using PTQ as well and it should give you an additional speed up inference wise.

Here is a tutorial[2] to leverage TRTorch.

[1] https://github.com/NVIDIA/TRTorch/tree/master/core [2] https://www.photoroom.com/tech/faster-image-segmentation-trt...


There's another level of speed you can unlock by combining with https://pytorch.org/docs/master/notes/cuda.html#cuda-graphs. i got (i kid you not) 20x speed on batch size = 1 inference by first using tensorrt to fuse kernels and then "graphing". and even for larger batch size it's just free perf gains

https://imgur.com/OKRbUNw


Holy crap that’s amazing! How complex is your model? And are there lots of parallelizable parts like filters or is it recurrent?


the model that i got 20x on is very simple - just a couple of convs and relus - it's for edge detection on a pseudo-embedded platform (jetson) - but the wins from cuda graphs are from two things: complete elimination of kernel individual launch times and complete elimination of allocations for intermediate tensors, which dominate runtime for small kernel sizes (e.g. batch size = 1).


That is so cool ! May I ask at which resolution you had those results ?

We managed to get up to 10x for very low resolutions (160) for a resnet101 but it usually plateaus for high resolutions (above 896x896) at a 1.7~1.9 speed-up. Although using Int8 gives even higher speed-ups (~times 3.6 for 896x896 input), for some tasks it degrades the performance too much.

I will definitely try your setup :)


indeed small resolutions (64x64) but i mean 2x speed is still nothing to sneeze at.


I agree, especially when it is free accuracy wise :)


As someone who's been vaguely interested in PyTorch inference optimization but has never had a clear jumping-in point, thank you for this comment! Nice to see a clear two-sentence explanation that actually makes sense to me, makes me really want to try out TRTorch and TensorRT!

Have a nice day internet stranger.


As far as I know, the ONNX format won't give you a performance boost on its own. However, there are ONNX optimizers for the ONNX runtime which will speed up your inference.

But if you are using Nvidia Hardware, then TensorRT should give you the best performance possible, especially if you change the precision level. Don't forget to simplify your ONNX model before you converting it to TensorRT though: https://github.com/daquexian/onnx-simplifier


> However, there are ONNX optimizers for the ONNX runtime

I think you meant that there are optimizers for the ONNX format. ONNX Runtime being one of them.


You are right, thanks. Mixed those up.


Main thing you want for server inference is auto batching. It's a feature that's included in onnxruntime, torchserve, nvidia triton inference server and ray serve.

If you have a lot of preprocessing and post logic in your model it can be hard to export it for onnxruntime or triton so I usually recommend starting with Ray Serve (https://docs.ray.io/en/latest/serve/index.html) and using an actor that runs inference with a quantized model or optimized with tensorrt (https://github.com/NVIDIA-AI-IOT/torch2trt)


Related to this question, can someone explain the design goal of torch.jit to me? Is it supposed to boost performance or just give a means to export models? I found my jitted code ran slower than interpreted pytorch, and the latter despite its asynchronous nature spent most of its time waiting for the next gpu kernel to start.

Having got a working torch model on cpu, what's the best path to actually making it run as fast as I feel it has potential to?


It’s both. torch.jit started life as an optimizer. I think fusion of pointwise kernels on GPU - which we finally extended to CPU in this release - was one of the early wins via jit.

But at some point it became a model export format for production environments that can’t use CPython for performance reasons.

I’m surprised that you’re seeing worse performance with jit. It sometimes takes 20-ish iterations for the jit to “settle down” but I’d expect roughly equal performance at worst. If you can share a repro, I’d be happy to take a quick look if you file an issue on GitHub. (I’m @bertmaher there)


you know what i still don't understand? what's taking so long to warm up? i see that there are graph passes that run to do various things at the TS IR level, but I don't see any stats being collected (on shapes) or something like that that then inform further optimization.


There’s a “profiling graph executor” that records shapes and then hands them off to a fusion compiler. The profiling executor re-specializes on every new shape it sees, but stops at 20 re-specialization.

We’re working on eliminating the dependence on shape specialization right now, since it’s kind of an unfortunate limitation for some workloads.


Oh also to answer your “as fast as possible” question: usually you’ll get the best performance by exporting your model to a perf-tuned runtime. We’ve seen really good results with TensorRT and (for transformers) FasterTransformer. I’ve also seen good results with ONNX runtime.

Staying within pytorch, we recently added torch.jit.optimize_for_inference (I think it’s in 1.10, though not entirely sure) that can apply a bunch of standard optimizations to a model and often provides some nice wins.


Torch.jit shouldn't impact your performance positively or negatively in my experience. Although I've only used it on cpu. It's as far as I know just used for model exports.

The nice thing about it though, is that you can embed native python code (that's compiled to c++) into the model artifact. It's allowed us to write almost all of the serving logic of our models very closely to the model code itself, giving a better overview than having the server logic written in a separate repo.

The server we use on top of this can be pretty "dumb", and just funnel all inputs to the model, which the Python code determines what to do with.

As for model speedups, maybe you should look into quantization? I also find that there's usually lots of low hanging fruit if you go over code and rewrite to quicker ops which are mathematically equivalent, but allocate less memory, or do less ops.


It makes it possible to lift your model out of python while handling programming constructs like loops and if statements (see torch.jit.script).

It also makes it possible to sidestep the GIL and remove the overhead of launching kernels from python, which only really makes a noticeable difference with models that queue up a lot of small operations on the GPU. (LSTMs are an example of where this would make a difference https://pytorch.org/blog/optimizing-cuda-rnn-with-torchscrip...)


Have you looked at TorchServe

https://pytorch.org/serve/


PyTorch's serialized models are just Python pickle files. So to load those you need the original classes that were used to build the model. By converting to ONNX you get rid of those dependencies.


Your mileage will certainly vary but I was able to eek out a lot more inference performance by exporting to ONNX, using a highspeed serving framework in that ecosystem and also relying on some computation graph optimizations that you can apply to the ONNX version using available community tools. Versus serving from PyTorch directly.

We were doing millions of inferences and we had a specific target of a couple thousand a second so a specific case for sure but that's my two cents.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: