From the numbers it looks like performance is similar to TensorFlow-DirectML from Microsoft. Which, unfortunately, also means it is ~6 times slower than CUDA on the same hardware.
Evidence so far would suggest that it's closed source.
The archive downloaded by the installer script does contain some source code, but it's mostly "generic" TensorFlow code, with some Python stubs that call off to native libraries (as you'd expect). It seems like all of the ML Compute stuff is contained within pre-compiled libraries (with some header files provided), but no source code.
I could be wrong here, and it might be that the intention is to open source the ML Compute components, but I don't think that's been done yet.
Weeks later and we still don't know nothing. How do M1's TPU cores (the dedicated, so called "AI engine", not M1' GPU!) compare to current, consumer Nvidia TPU cores, eg those from the 3080 or 3090 (not Nvidia's GPU cores and please, no CPU and not some random Intel GPU!) re model training (not execution!)?
If there's someone who has an educated guess of the performance difference between...
Apple claims 11 trillion OPS, NVIDIA claims 285 TOPS, my educated guess is that real-life performance is off by same factor, leading to ~25x advantage of 3090 compared to M1
On all of this type of computation the real bottleneck is memory, size and bandwidth. You have to read and write back the data, you just simply can not perform faster than your memory bandwidth.
ex If you have 2GB model ( 1G float16 ), which does not fit in any cache, and have to perform calculation on all of the data, than for 1 operation on all of the data 4 GB bandwidth is required. M1 17 times/s (if you don't make anything, so its a impossible max) 3090 233 times/s (that also not so practical) so 3090 10x M1 is a lower estimate.
3090 memory 24GB bandwidth 936.2 GB/s
M1 memory 16GB max bandwidth 68GB/s (shared with other parts of the system)
30 TFLOPS are general-purpose ones, for tensor cores, depending on what exactly you count can be 71 TFLOPS (fp32 accumulator), 142 TFLOPS (fp16 accumulator), or 282 TFLOPS (sparse tensors)
That depends on how tensorflow core ML performance compares to M1 neural engine core performance and the type and amount of memory.
According to these notes, it performs roughly 4x as fast compared to an intel macbook which runs cpu only.
1.7GHz quad-core Intel Core i7-based 13-inch MacBook Pro system with Intel Iris Plus Graphics 645, 16GB of RAM,
https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...
M1 GPU has 2.6TFlops for the GPU (FP32). 1080Ti has 11.3TFlops.
M1 GPU isn't really that powerful, it's comparable to nVidia 760 (from 2013). The M1's Neural Engine does have more kick of course, but the GPU otherwise is nothing superb (other than marketing).
While I agree in general, I do want to point out this is a still a lightweight entry level laptop SoC compared to a desktop GPU you've mentioned. It can also run in fanless mode like in MBA m1.
Even still, 1050 ti uses up around 75w of power (2016) and 760 has a TDP of 170W (2013) while M1 GPU is much less than 10W (at full load, it peaks at 16w in Mac mini for the entire SoC, not just GPU).
It would be interesting to see what Apple does when it scales it up to 75w or more for their own custom desktop GPUs which is rumored in development. However, separate desktop GPU does lose the benefits of UMA that makes M1 fast.
But they created their GPU on world-leading TSMC 5nm. I wonder that M1's GPU perf/watt is mostly came from process advantage. I'd like to see comparison with Kirin 9000. (maybe published by AnandTech)
I mean, my entry deep learning machine was a refurbished Linux box with a pile of gpus I found in the electronics recycling bin at work. Later upgraded to a 1080 and then a 2080, as one does.
The nice thing about the desktop is that it can just train for days and I don't need to work about using it for other things, losing time moving locations, etc. It's also still probably cheaper than the m1 laptop, even with a nicer gpu than what you can find in the trash.
I did the same (well, used a discarded GPU that my kids didn't need any more) and I have one caution: If you're using PyTorch, you'll want to have a CPU that at least supports AVX. The C++ libraries that ship with PyTorch assume AVX at compile time and don't have an option to disable at runtime. It's a PITA to recompile the entire stack.
When it comes to Apple's mobile GPUs (and now M1) their main advantage is that they have a custom architecture and compiler and they're both the result of lots of experts working hard on both and leveraging existing work (in the architecture case, the fact that they started from PowerVR's world-class design, and in the compiler case, the fact that their shader compiler is based on LLVM). They're really well-designed chips. Unfortunately there's no particular reason that they would be better at ML than any other GPU, since they had no reason to consider it in the design process until somewhat recently.
Also, besides memory speed which certainly will be an important factor, it might depend on the model architecture, fp32/fp16 etc. Much like you cannot say CPU 1 is faster than CPU. 2 - it totally depends on the type of workload - same goes for deep learning benchmarks.
Performance in games typically doesn't have much correspondence with how good a GPU will be at compute. Rendering games, videos, etc is largely constrained by things like memory bandwidth, culling, image decoding and blending - all stuff typically done with special-purpose hardware that won't be very useful for compute or neural nets. Sometimes games or media have complex shaders that are entirely limited by compute, but it's not terribly common.
Apple's dedicated ML hardware is probably quite good, but we don't have any way to know how good without doing math on die size + power draw and running benchmarks.
The price doesn't mean much, the 1080 is pretty old at this point and the M1 has considerably smaller transistors. Considering how expensive Apple products are it's quite possible the cost of an M1 chip isn't much lower than that of the core shipped inside the 1080. A lot of what you pay for on a 1080 is cooling, display output, and power delivery.
Indeed, some of the cost of a 1080 at this point may be that the demand is constrained to people that want replacement parts for uniform deployments...
I would love to see the performance comparison too to other Nvidia GPUs. I thought TFlops is not a good way to gauge performance when the architecture are different? You cannot even compare Nvidia Pascal and Turing with TFlops meaningfully (?).
Google colab is a much better tool than a new computer for getting into ML. It's free, requires no dependency management, easy to share, and has tons of example notebooks you can reproduce as easily as duplicating a google doc.
Not to mention a Linux based workstation in the limit will have fewer headaches than mac these days. Package management doesn't require homebrew or dockerized everything, selinux is surprisingly easier to configure than the Mac security subsystems, etc.
I don’t know. As someone who recently built a Linux workstation but whose computing is generally done on a Mac, I had all sorts of trouble with surprising things. Like my mouse stopped working one day ... no idea what happened, tried googling, tried asking, but nothing I did fixed it. Ended up just reinstalling the OS.
Having had a similar experience, my new strategy is to set up a server and just ssh in to do ML development. That way I don't have to worry about mouse/display/wifi drivers.
It doesn't really prove much beyond that you can probably get enough speed on an M1 to debug your training loop. That's impressive but we need to see more to see if the "only 4x slower than a colab GPU (ie at least a K80)" numbers hold up.
Was the plus sign after 11.0 dropped by HN software? “Hardware-Accelerated TensorFlow for macOS 11.0+” reads very differently without the last character.
As far as I understand, Apple's ML Compute framework cannot use the neural cores in eager mode (only in graph mode). So, the neural cores would probably only be used with TorchScript.
(The TensorFlow implementation has the same limitation, but using graph execution was traditionally more popular in TensorFlow, since it didn't initially have an eager mode.)
It might be that no one has written the code to make it work in eager mode, but I'm trying and failing to think a hardware reason this could be the case.
From NeurIPS (today, actually), they showed that it works both in graph and eager mode in a short 15 minute talk. However, no discussion about any potential difference in performance.
I think that those can both be correct, and they were just being a little slippery with their language. During the talk they tried pretty hard to focus on saying their compute engine would pick the best hardware to run on... and I suppose if the GPU has terrible performance for eager mode, then the CPU would be the best.
However I find that quite disappointing. I hope there’s significant upgrades coming down the pipeline for when they start releasing more powerful Apple silicon based devices.
As an aside to this discussion which do we think will prove to work better - use all the transistors available for a GPU, specialise and have separate “neural engine” and “gpu”. My guess is eventually the latter as specific hardware for each case can usually be made better for both. Is controlling the CUDA stuff going to keep nvidia ahead or compensate for their more generalised hardware support?
I wonder how much it is even possible to train cutting-edge models models. I am sure that there are still tasks where a simple feed forward network or RNN.
However, you can just barely finetune any for the base pretrained transformer models (e.g. BERT base or XLM-R base) with 8GB VRAM and need 12GB or 16GB VRAM to finetune larger models. Given that M1 Macs are currently limited to 16GB of shared RAM, I think training competitive models is currently very limited with the memory limitations.
I guess the real fun only starts when Apple releases higher-end machines with 32 or 64GB of RAM.
Well unless they also support acceleration on AMD GPUs, this is not so interesting. Training on x86_64 CPU cores or integrated Intel GPUs is really slow compared to training on modern NVIDIA GPUs with Tensor Cores (or AMD GPUs with ROCm, if you can get the ROCm stack running without crashing with obscure bugs).
The M1 blows away Intel CPUs with integrated GPUs (and modern NVIDIA GPUs will probably blow away the M1 results, otherwise they'd show the competition ;)).
Is there a list of commonly used software and packages that are incompatible with M1 currently, and the efforts being done to address that (or alternatives)?
> Native hardware acceleration is supported on Macs with M1 and Intel-based Macs through Apple’s ML Compute framework.
This lead me to believe that it supported hardware acceleration for the architecture (because it lists Intel too), which made it seem ambiguous if it supports the specific accelerator abilities of the M1.
What "native hardware acceleration" is available on Intel-based macs?
ML Compute is GPU-accelerated on both Intel and M1, as well as natively supporting each CPU's respective vector instructions via the BNNS API in Accelerate.framework:
The MLCompute API supports three options "CPU", "GPU" or "any", which is GPU sometimes and CPU sometimes. Not the neural engine.
There's some MLComputeANE code in the framework, but afaik there's no way to use it yet. Also, from looking at it, I believe the neural engine only supports float16.
Core ML on the other hand is designed to use trained models and runs automatically on CPU, GPU or ANE depending on what fits the currently executed model layer best: https://developer.apple.com/documentation/coreml
What you're really asking is if this supports the neural engine?
GPU support is of course 'native' to the M1 also and seems to be the limit of the support here.
I know several data scientists (the whole team in my company and some people in others) who just use a laptop (and a Mac at that).
>Maybe I'm in my own bubble but does TF have small scale uses where you would run it on a laptop?
That, plus huge scale uses where it doesn't make sense to run on a beefy desktop even, so you just use whatever to deploy in a remote cloud/cluster/etc. A laptop means you can do it from wherever, and a laptop with good specs/battery means you can do all other stuff, for longer, with it...