Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Main thing you want for server inference is auto batching. It's a feature that's included in onnxruntime, torchserve, nvidia triton inference server and ray serve.

If you have a lot of preprocessing and post logic in your model it can be hard to export it for onnxruntime or triton so I usually recommend starting with Ray Serve (https://docs.ray.io/en/latest/serve/index.html) and using an actor that runs inference with a quantized model or optimized with tensorrt (https://github.com/NVIDIA-AI-IOT/torch2trt)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: