There seems to be a limit to the size of model you can load before CoreML decide...

There seems to be a limit to the size of model you can load before CoreML decides it has to run on CPU instead (see the second link in my previous comment)

If it could use the full 'unified' memory that would be a big step towards getting these models running on it

I'm unsure how the performance compares to a beefy Intel CPU, but there's some numbers here [1] for running a variant of the small distilbert-base model on the Neural Engine... it's ~10x faster than running on the M1 CPU

[1] https://github.com/anentropic/experiments-coreml-ane-distilb...