Hardware-Accelerated TensorFlow and TensorFlow Addons for macOS 11.0

lostmsu · on Dec 6, 2020

From the numbers it looks like performance is similar to TensorFlow-DirectML from Microsoft. Which, unfortunately, also means it is ~6 times slower than CUDA on the same hardware.

gjsman-1000 · on Dec 7, 2020

Source?

lostmsu · on Dec 7, 2020

Guesstimates from previous experiences. So take it with a grain of salt.

dindresto · on Dec 6, 2020

Why does the repository only contain install scripts? Where is the code for the implementation?

JosephRedfern · on Dec 6, 2020

Evidence so far would suggest that it's closed source.

The archive downloaded by the installer script does contain some source code, but it's mostly "generic" TensorFlow code, with some Python stubs that call off to native libraries (as you'd expect). It seems like all of the ML Compute stuff is contained within pre-compiled libraries (with some header files provided), but no source code.

I could be wrong here, and it might be that the intention is to open source the ML Compute components, but I don't think that's been done yet.

dindresto · on Dec 6, 2020

Thanks! That's what I suspected, but I wasn't entirely sure when looking into the wheel.

bayindirh · on Dec 6, 2020

It's closed source. The .tar.gz file contains some .dylibs as the base and "source code" distributions only contains the scripts, nothing else.

It's version "0.1alpha0" indicates it's at PoC state so things may change.

notRobot · on Dec 6, 2020

From https://github.com/apple/tensorflow_macos/blob/master/script...

    INSTALLER_PATH=https://github.com/apple/tensorflow_macos/releases/download/v0.1alpha0/tensorflow_macos-0.1alpha0.tar.gz

So, see: https://github.com/apple/tensorflow_macos/releases/

BenTheElder · on Dec 7, 2020

I don't understand this comment. Neither of these links are sources, one is just an installer and the other is a binary release.

If you mean the "source code" zip in the latter link, that's just github zipping the repo, which contains no useful source code.

coob · on Dec 6, 2020

I can't answer your question, but the source code is here for those interested:

https://github.com/apple/tensorflow_macos/archive/v0.1alpha0...

danieldk · on Dec 6, 2020

That's just a 8.4KiB ZIP file of the repository, which obviously does not contain anything interesting.

desmap · on Dec 6, 2020

Weeks later and we still don't know nothing. How do M1's TPU cores (the dedicated, so called "AI engine", not M1' GPU!) compare to current, consumer Nvidia TPU cores, eg those from the 3080 or 3090 (not Nvidia's GPU cores and please, no CPU and not some random Intel GPU!) re model training (not execution!)?

If there's someone who has an educated guess of the performance difference between...

M1's TPU vs 3090's TPU re training

...please let us know.

EvgeniyZh · on Dec 6, 2020

Apple claims 11 trillion OPS, NVIDIA claims 285 TOPS, my educated guess is that real-life performance is off by same factor, leading to ~25x advantage of 3090 compared to M1

trhway · on Dec 6, 2020

nvidia is ~30 TFLOPS, i.e 3x.

andy_threos_io · on Dec 6, 2020

On all of this type of computation the real bottleneck is memory, size and bandwidth. You have to read and write back the data, you just simply can not perform faster than your memory bandwidth. ex If you have 2GB model ( 1G float16 ), which does not fit in any cache, and have to perform calculation on all of the data, than for 1 operation on all of the data 4 GB bandwidth is required. M1 17 times/s (if you don't make anything, so its a impossible max) 3090 233 times/s (that also not so practical) so 3090 10x M1 is a lower estimate.

3090 memory 24GB bandwidth 936.2 GB/s

M1 memory 16GB max bandwidth 68GB/s (shared with other parts of the system)

EvgeniyZh · on Dec 6, 2020

30 TFLOPS are general-purpose ones, for tensor cores, depending on what exactly you count can be 71 TFLOPS (fp32 accumulator), 142 TFLOPS (fp16 accumulator), or 282 TFLOPS (sparse tensors)

fsiefken · on Dec 6, 2020

I'd like to know how the performance compares to laptops with for example Nvidia 1060 chips or Jetson Nano's.

0x008 · on Dec 6, 2020

Well I think we will find it roughly in the neighborhood of a 1080 (non-Ti) for optimized models.

fsiefken · on Dec 6, 2020

That depends on how tensorflow core ML performance compares to M1 neural engine core performance and the type and amount of memory. According to these notes, it performs roughly 4x as fast compared to an intel macbook which runs cpu only. 1.7GHz quad-core Intel Core i7-based 13-inch MacBook Pro system with Intel Iris Plus Graphics 645, 16GB of RAM, https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...

So I'd guess it's slower then an 1080 non-Ti

codercotton · on Dec 6, 2020

I thought the M1 GPU alone was about as powerful as a 1080. Wouldn't the Neural Engine cores make it faster than a 1080 for ML?

mikhailt · on Dec 6, 2020

Is it possible you meant the gtx 1650/1050 Ti mobile GPU like mentioned in Tom's hardware article: https://www.tomshardware.com/news/apple-silicon-m1-graphics-...

They have comparable TFLOPs range.

1080 mobile is way faster than 1060 (60-90% faster [^1]) and definitely ahead of M1 by a large factor.

[1]: https://gpu.userbenchmark.com/Compare/Nvidia-GTX-1080-Mobile...

burmanm · on Dec 6, 2020

M1 GPU has 2.6TFlops for the GPU (FP32). 1080Ti has 11.3TFlops.

M1 GPU isn't really that powerful, it's comparable to nVidia 760 (from 2013). The M1's Neural Engine does have more kick of course, but the GPU otherwise is nothing superb (other than marketing).

mikhailt · on Dec 6, 2020

While I agree in general, I do want to point out this is a still a lightweight entry level laptop SoC compared to a desktop GPU you've mentioned. It can also run in fanless mode like in MBA m1.

A more modern comparison with mobility GPU would be GTX 1050 Ti mobility, which is around 10-20% faster than 760: https://gpu.userbenchmark.com/Compare/Nvidia-GTX-760-vs-Nvid...

Even still, 1050 ti uses up around 75w of power (2016) and 760 has a TDP of 170W (2013) while M1 GPU is much less than 10W (at full load, it peaks at 16w in Mac mini for the entire SoC, not just GPU).

It would be interesting to see what Apple does when it scales it up to 75w or more for their own custom desktop GPUs which is rumored in development. However, separate desktop GPU does lose the benefits of UMA that makes M1 fast.

fomine3 · on Dec 7, 2020

But they created their GPU on world-leading TSMC 5nm. I wonder that M1's GPU perf/watt is mostly came from process advantage. I'd like to see comparison with Kirin 9000. (maybe published by AnandTech)

sdenton4 · on Dec 6, 2020

I mean, my entry deep learning machine was a refurbished Linux box with a pile of gpus I found in the electronics recycling bin at work. Later upgraded to a 1080 and then a 2080, as one does.

The nice thing about the desktop is that it can just train for days and I don't need to work about using it for other things, losing time moving locations, etc. It's also still probably cheaper than the m1 laptop, even with a nicer gpu than what you can find in the trash.

coredog64 · on Dec 6, 2020

I did the same (well, used a discarded GPU that my kids didn't need any more) and I have one caution: If you're using PyTorch, you'll want to have a CPU that at least supports AVX. The C++ libraries that ship with PyTorch assume AVX at compile time and don't have an option to disable at runtime. It's a PITA to recompile the entire stack.

p1esk · on Dec 6, 2020

Not sure what you are talking about. Pytorch binaries don’t assume anything about your CPU.

kevingadd · on Dec 6, 2020

When it comes to Apple's mobile GPUs (and now M1) their main advantage is that they have a custom architecture and compiler and they're both the result of lots of experts working hard on both and leveraging existing work (in the architecture case, the fact that they started from PowerVR's world-class design, and in the compiler case, the fact that their shader compiler is based on LLVM). They're really well-designed chips. Unfortunately there's no particular reason that they would be better at ML than any other GPU, since they had no reason to consider it in the design process until somewhat recently.

joakleaf · on Dec 6, 2020

The M1 gpu is able to do 2.5 tflops while 1080 is said to be able to do 8 tflops.

From Geekbench it also looks like the m1 gpu is about 1/4-1/3 as powerful as a 1080.

The m1 may benefit from faster ram and shared memory though.

Apple states that the neural engine is able to do about 11 trillion operations per second (but oddly enough, they don’t report tflops).

0x008 · on Dec 6, 2020

Also, besides memory speed which certainly will be an important factor, it might depend on the model architecture, fp32/fp16 etc. Much like you cannot say CPU 1 is faster than CPU. 2 - it totally depends on the type of workload - same goes for deep learning benchmarks.

kevingadd · on Dec 6, 2020

Performance in games typically doesn't have much correspondence with how good a GPU will be at compute. Rendering games, videos, etc is largely constrained by things like memory bandwidth, culling, image decoding and blending - all stuff typically done with special-purpose hardware that won't be very useful for compute or neural nets. Sometimes games or media have complex shaders that are entirely limited by compute, but it's not terribly common.

Apple's dedicated ML hardware is probably quite good, but we don't have any way to know how good without doing math on die size + power draw and running benchmarks.

Thaxll · on Dec 6, 2020

1080? 1080 is an high end GPU that cost 700$.

kevingadd · on Dec 6, 2020

The price doesn't mean much, the 1080 is pretty old at this point and the M1 has considerably smaller transistors. Considering how expensive Apple products are it's quite possible the cost of an M1 chip isn't much lower than that of the core shipped inside the 1080. A lot of what you pay for on a 1080 is cooling, display output, and power delivery.

sdenton4 · on Dec 6, 2020

Indeed, some of the cost of a 1080 at this point may be that the demand is constrained to people that want replacement parts for uniform deployments...

kalleboo · on Dec 6, 2020

You may be thinking of the benchmarks that ranked it alongside a 1050 Ti

0x008 · on Dec 6, 2020

Afaik the new tf alpha for m1 .Mac does not make use of the neural cores.

syntaxing · on Dec 6, 2020

I would love to see the performance comparison too to other Nvidia GPUs. I thought TFlops is not a good way to gauge performance when the architecture are different? You cannot even compare Nvidia Pascal and Turing with TFlops meaningfully (?).

xiphias2 · on Dec 6, 2020

Time to achieving a precision is the best metric (and the only relevant metric) for training, everything else is flawed.

KingOfCoders · on Dec 6, 2020

Two people I know bought M1 Macs to get into ML. I do think many confuse training performance with model execution performance.

I currently use a 2080tioc and wish for something faster to not wait for runs that long as they take you out of the zone.

emcq · on Dec 6, 2020

Google colab is a much better tool than a new computer for getting into ML. It's free, requires no dependency management, easy to share, and has tons of example notebooks you can reproduce as easily as duplicating a google doc.

Not to mention a Linux based workstation in the limit will have fewer headaches than mac these days. Package management doesn't require homebrew or dockerized everything, selinux is surprisingly easier to configure than the Mac security subsystems, etc.

mrtranscendence · on Dec 6, 2020

I don’t know. As someone who recently built a Linux workstation but whose computing is generally done on a Mac, I had all sorts of trouble with surprising things. Like my mouse stopped working one day ... no idea what happened, tried googling, tried asking, but nothing I did fixed it. Ended up just reinstalling the OS.

glial · on Dec 7, 2020

Having had a similar experience, my new strategy is to set up a server and just ssh in to do ML development. That way I don't have to worry about mouse/display/wifi drivers.

nl · on Dec 6, 2020

There are some very preliminary benchmarks on a very small model here: https://github.com/apple/tensorflow_macos/issues/10

It doesn't really prove much beyond that you can probably get enough speed on an M1 to debug your training loop. That's impressive but we need to see more to see if the "only 4x slower than a colab GPU (ie at least a K80)" numbers hold up.

sokoloff · on Dec 6, 2020

Was the plus sign after 11.0 dropped by HN software? “Hardware-Accelerated TensorFlow for macOS 11.0+” reads very differently without the last character.

raverbashing · on Dec 6, 2020

Given that macOS 11.0 is the latest version the difference is immaterial at this time.

sokoloff · on Dec 6, 2020

You’re totally right, of course, which is embarrassing. I had been so keyed on the xx in 10.xx, I was sure we were on 16 by now.

hyko · on Dec 6, 2020

Not quite right, as 11.1 is already out in beta.

tasogare · on Dec 6, 2020

I doesn’t matter yet since macOS 11 is the latest version.

Edit: I kept that tab open for too long...

kernelsanderz · on Dec 6, 2020

I this intended for inference of for training?

gok · on Dec 6, 2020

vimy · on Dec 6, 2020

I hope the pytorch port isn’t far behind.

danieldk · on Dec 6, 2020

As far as I understand, Apple's ML Compute framework cannot use the neural cores in eager mode (only in graph mode). So, the neural cores would probably only be used with TorchScript.

(The TensorFlow implementation has the same limitation, but using graph execution was traditionally more popular in TensorFlow, since it didn't initially have an eager mode.)

nl · on Dec 6, 2020

I don't understand why this should be the case.

It might be that no one has written the code to make it work in eager mode, but I'm trying and failing to think a hardware reason this could be the case.

0x008 · on Dec 6, 2020

I thought the tensor flow alpha for mac uses gpu hardware acceleration, not neural cores?

reedf1 · on Dec 6, 2020

Doesn't pytorch also include a graph option? Or is it only a graph-like interface?

danieldk · on Dec 6, 2020

Yes

https://pytorch.org/tutorials/beginner/hybrid_frontend/learn...

https://pytorch.org/docs/stable/jit.html

But people normally use PyTorch in eager mode.

atty · on Dec 6, 2020

From NeurIPS (today, actually), they showed that it works both in graph and eager mode in a short 15 minute talk. However, no discussion about any potential difference in performance.

danieldk · on Dec 6, 2020

Interesting, because the README in the repository states that it will only use CPU cores in eager mode:

Please note that in eager mode, ML Compute will use the CPU.

atty · on Dec 6, 2020

I think that those can both be correct, and they were just being a little slippery with their language. During the talk they tried pretty hard to focus on saying their compute engine would pick the best hardware to run on... and I suppose if the GPU has terrible performance for eager mode, then the CPU would be the best.

However I find that quite disappointing. I hope there’s significant upgrades coming down the pipeline for when they start releasing more powerful Apple silicon based devices.

orf · on Dec 6, 2020

I hope this is eventually upstreamed and binary wheel packages become available.

andy_ppp · on Dec 6, 2020

As an aside to this discussion which do we think will prove to work better - use all the transistors available for a GPU, specialise and have separate “neural engine” and “gpu”. My guess is eventually the latter as specific hardware for each case can usually be made better for both. Is controlling the CUDA stuff going to keep nvidia ahead or compensate for their more generalised hardware support?

bitL · on Dec 6, 2020

Are there any benchmarks on real-world-sized datasets?

danieldk · on Dec 6, 2020

I wonder how much it is even possible to train cutting-edge models models. I am sure that there are still tasks where a simple feed forward network or RNN.

However, you can just barely finetune any for the base pretrained transformer models (e.g. BERT base or XLM-R base) with 8GB VRAM and need 12GB or 16GB VRAM to finetune larger models. Given that M1 Macs are currently limited to 16GB of shared RAM, I think training competitive models is currently very limited with the memory limitations.

I guess the real fun only starts when Apple releases higher-end machines with 32 or 64GB of RAM.

WanderPanda · on Dec 6, 2020

You are probably right, but I guess there are plenty of interesting use cases << 8GB VRAM :D

lostmsu · on Dec 6, 2020

The framework is supposed to work on AMD64 too, so Pros with 32GB+ RAM.

danieldk · on Dec 6, 2020

Well unless they also support acceleration on AMD GPUs, this is not so interesting. Training on x86_64 CPU cores or integrated Intel GPUs is really slow compared to training on modern NVIDIA GPUs with Tensor Cores (or AMD GPUs with ROCm, if you can get the ROCm stack running without crashing with obscure bugs).

Also see the benchmarks in the marketing PR:

https://blog.tensorflow.org/2020/11/accelerating-tensorflow-...

The M1 blows away Intel CPUs with integrated GPUs (and modern NVIDIA GPUs will probably blow away the M1 results, otherwise they'd show the competition ;)).

Vomzor · on Dec 6, 2020

AMD GPUs are supported.

danieldk · on Dec 6, 2020

Interesting! Do you have a source?

0x008 · on Dec 6, 2020

In computer vision you can go far wit 8GB VRAM.

vmception · on Dec 6, 2020

Is there a list of commonly used software and packages that are incompatible with M1 currently, and the efforts being done to address that (or alternatives)?

theodric · on Dec 6, 2020

Yes. https://news.ycombinator.com/item?id=25164540

vmception · on Dec 6, 2020

oh my god, no VMWare Fusion, Parallels or VirtualBox, lol.

on the other front, I'm hearing great things about Rosetta 2 compatibility so that's not nearly a deal breaker!

Looking good, still all the more reason to wait for additional product offerings (faster processor, more RAM, larger form factor)

FreakyT · on Dec 6, 2020

Has anyone tried using this with GPT-2? I’m curious if it would run locally at usable speeds.

kevingadd · on Dec 6, 2020

GPT-2 appears to require at least 24GB of memory, so if that's accurate absolutely not.

minimaxir · on Dec 6, 2020

You can load the 1.5B GPT-2 into a 16GB VRAM GPU...with fp16 downcasting. Still not ideal for M1 though.

minimaxir · on Dec 6, 2020

For the small 124M model, it’s already runable on the CPU.

It’s going beyond that is tricky.

FreakyT · on Dec 6, 2020

I’ve run the 1.5B model on my CPU, and it works...just very very slowly.

sneak · on Dec 6, 2020

Does this support the special hardware in the M1? If so, that's huge news.

terhechte · on Dec 6, 2020

Well it says: "Native hardware acceleration is supported on Macs with M1 [...]"

sneak · on Dec 6, 2020

> Native hardware acceleration is supported on Macs with M1 and Intel-based Macs through Apple’s ML Compute framework.

This lead me to believe that it supported hardware acceleration for the architecture (because it lists Intel too), which made it seem ambiguous if it supports the specific accelerator abilities of the M1.

What "native hardware acceleration" is available on Intel-based macs?

Reason077 · on Dec 6, 2020

ML Compute is GPU-accelerated on both Intel and M1, as well as natively supporting each CPU's respective vector instructions via the BNNS API in Accelerate.framework:

https://developer.apple.com/documentation/mlcompute

I don't think Apple have stated whether MLCompute also utilises the M1's "Neural Engine", but geohot did manage to target it in tinygrad:

https://github.com/geohot/tinygrad/tree/master/ane

georgehotz · on Dec 6, 2020

The MLCompute API supports three options "CPU", "GPU" or "any", which is GPU sometimes and CPU sometimes. Not the neural engine.

There's some MLComputeANE code in the framework, but afaik there's no way to use it yet. Also, from looking at it, I believe the neural engine only supports float16.

zahlex · on Dec 7, 2020

As far as my understanding goes, ML Compute is for training models via CPU or GPU and not ever supposed to use the ANE: https://developer.apple.com/documentation/mlcompute

Core ML on the other hand is designed to use trained models and runs automatically on CPU, GPU or ANE depending on what fits the currently executed model layer best: https://developer.apple.com/documentation/coreml

iopuy · on Dec 8, 2020

Have you tried using Zimmer?

0x008 · on Dec 6, 2020

But that’s a CoreML model so that might not be possible with regular tensor flow models?

andybak · on Dec 6, 2020

I agree. That is a terribly confusing sentence.

TickleSteve · on Dec 6, 2020

What you're really asking is if this supports the neural engine? GPU support is of course 'native' to the M1 also and seems to be the limit of the support here.

Neural engine support is not stated.

te_chris · on Dec 6, 2020

Seriously, go to the link. It literally says it on the README.

qualityonly · on Dec 6, 2020

Who is the demographic for this? Professionals will be using multithreading and/or multiple computers for these problems.

Maybe college kids?

coldtea · on Dec 6, 2020

You'd be surprised as to what professionals use.

qualityonly · on Dec 6, 2020

Like what? I'm given a beefy computer to do my DS work.

Maybe I'm in my own bubble but does TF have small scale uses where you would run it on a laptop?

coldtea · on Dec 6, 2020

I know several data scientists (the whole team in my company and some people in others) who just use a laptop (and a Mac at that).

>Maybe I'm in my own bubble but does TF have small scale uses where you would run it on a laptop?

That, plus huge scale uses where it doesn't make sense to run on a beefy desktop even, so you just use whatever to deploy in a remote cloud/cluster/etc. A laptop means you can do it from wherever, and a laptop with good specs/battery means you can do all other stuff, for longer, with it...