This should, in theory, work with CUDA; my GPU doesn't have enough RAM to do it (it runs out at 2.9GiB allocated, I have 4GiB, but am running a compositing desktop, which chews up about 600MiB; not sure where the other ~400MiB went)
[edit]
I confirmed CUDA worked with the "small" model, which used 3.3GB of GPU ram, and resulted in much poorer recognition than the "medium" model on my CPU (but it ran at least two orders of magnitude faster).
CUDA worked fine with large on my 2080Ti FWIW. The speedup is ridiculous, as expected. My Ryzen 3800X used almost an hour transcribing a minute worth of speech, while the 2080Ti does it in like 10-20 seconds.
I'm on Windows, using Task Manager, the dedicated GPU memory went from 1GB before run to about 9.8GB for the most time during run, peaking at 10.2GB. So pretty close to the 11GB limit of my 2080Ti it seems.
[edit]
I confirmed CUDA worked with the "small" model, which used 3.3GB of GPU ram, and resulted in much poorer recognition than the "medium" model on my CPU (but it ran at least two orders of magnitude faster).