I wonder how much it is even possible to train cutting-edge models models. I am sure that there are still tasks where a simple feed forward network or RNN.
However, you can just barely finetune any for the base pretrained transformer models (e.g. BERT base or XLM-R base) with 8GB VRAM and need 12GB or 16GB VRAM to finetune larger models. Given that M1 Macs are currently limited to 16GB of shared RAM, I think training competitive models is currently very limited with the memory limitations.
I guess the real fun only starts when Apple releases higher-end machines with 32 or 64GB of RAM.
Well unless they also support acceleration on AMD GPUs, this is not so interesting. Training on x86_64 CPU cores or integrated Intel GPUs is really slow compared to training on modern NVIDIA GPUs with Tensor Cores (or AMD GPUs with ROCm, if you can get the ROCm stack running without crashing with obscure bugs).
The M1 blows away Intel CPUs with integrated GPUs (and modern NVIDIA GPUs will probably blow away the M1 results, otherwise they'd show the competition ;)).
However, you can just barely finetune any for the base pretrained transformer models (e.g. BERT base or XLM-R base) with 8GB VRAM and need 12GB or 16GB VRAM to finetune larger models. Given that M1 Macs are currently limited to 16GB of shared RAM, I think training competitive models is currently very limited with the memory limitations.
I guess the real fun only starts when Apple releases higher-end machines with 32 or 64GB of RAM.