Arm announces its new premium CPU and GPU designs

ChuckMcM · on May 27, 2019

This is interesting to me, lots of noise around "machine learning" in the GPU rather than graphics which kind of validates Google's TPU, Apple's AI co-processors, and Nvidia's Jetson. Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.

dragontamer · on May 27, 2019

GPUs are the 1980s style SIMD compute system. A GPU is designed to run many, many, many threads. A Vega64 for example, can run 163,840 threads at the same time (4096 shaders x 10-way SMT x 4-threads per unit)... while a CPU Threadripper 2950x can "only" run 32 threads at once (16 cores x 2-way SMT).

However, the GPU-shader is far more constrained than a CPU-thread. A GPU-shader is "locked" to performing the same tasks as other cores. That is to say: 64 "GPU threads" on AMD systems have the same program counter / instruction pointer.

32 "GPU threads" on NVidia systems have the same program counter.

This means that if one thread loops 10,000 times in a program, then that will force 32 other threads (on NVidia) or 64 other threads (on AMD) to loop 10,000 times (even if those threads individually only need to loop 500 times).

This restriction doesn't really matter in cases of matrix multiplication: where all threads would loop the same number of times. But in cases of like, Chess Minimax algorithm, the "thread divergence" makes it difficult to actually port Chess Minimax to a GPU.

-------

Machine learning... (EDIT: or more specifically, Convolutional Neural Networks) is almost purely a matrix multiplication problem, making it ideal for the SIMD GPU architecture.

CPUs have SIMD units by the way (SSE, AVX, etc. etc.), but GPUs are specialized systems that can only do SIMD. So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit.

-------

EDIT: Google's TPU goes beyond the GPU-model (SIMD architecture) and has dedicated matrix-multiplication units. These units can only perform matrix multiplication, nothing else, and are designed to do it as quickly and power-efficiently as possible.

NVidia's "Turing" line of GPUs have dedicated matrix multiplication units called "tensor units". NVidia basically grafted a dedicated FP16 matrix multiplication unit onto their GPUs.

The_rationalist · on May 27, 2019

"So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit." Except if memory latency transfer is a bottleneck. I've seen many examples of SIMD beating GPU e.g pathfinder v3 text renderer

slaymaker1907 · on May 27, 2019

While memory latency transfer between the CPU and GPU is an issue, GPUs have absolutely incredible throughput when moving things from one place on the GPU somewhere else on the GPU. This is why researchers are looking at GPUs for running graph algorithms.

pavlov · on May 27, 2019

It’s interesting that the cycle is complete and we’re back to the 1975 situation: “Wouldn’t it be great to have a personal computer that could do all these things we currently do by connecting to a mainframe?”

Only the PDP-10 has been replaced by Google.

dragontamer · on May 27, 2019

All modern computers, even cell phones, have the capabilities of mainframes from the 1970s and 1980s.

Back in the 1980s, a personal computer was loaded with an operating system that could only support a single user. A "mainframe" was a system that could support more than one user at a time, and you'd connect to it using telnet.

Today, your Linux PC loaded with SSH effectively functions as a mainframe. Heck, you can load virtual machines that themselves can pretend to be mainframes from the 1970s. Heck... your cell phone uses this feature and pretends that apps are different users for maximum security.

----

In contrast: GPUs are simply the 1980s style SIMD super-computers. It turns out SIMD Supercomputer style was the fastest way to calculate where pixels go on the screen (Conceptually: each pixel-shader basically gets to run its own thread: so a 1920x1080 screen has 2073600 threads to run per frame)

The best architecture to run millions of simple threads is a GPU or SIMD-computer (See https://en.wikipedia.org/wiki/Connection_Machine)

> The CM-1, depending on the configuration, has as many as 65,536 individual processors, each extremely simple, processing one bit at a time.

tempguy9999 · on May 27, 2019

> GPUs are simply the 1980s style SIMD super-computers

May be a terminology thing, or a technology I don't know, but I though '80s supercomputers like crays were vector processors, whereas I'm not aware of any commercial super at the time being SIMD (AKA array processor).

Joky · on May 27, 2019

I'm not super familiar with the distinction you're making ("array processor" vs "vector processor"), do you have any pointer I could look at?

tempguy9999 · on May 27, 2019

It seems to be a terminology confusion of mine. I did a bit of reading and it seems they're effectively the same thing implemented differently. An array processor in my mind was a lockstep SIMD thingy.

A vector processor as I understood it is nicely described here <https://www.quora.com/What-is-meant-by-an-array-processor-an... as "the earliest Crays had vector instructions, that quickly fed a stream of operands through a pipelined processor", which is not exactly the same implementation as SIMD, but after reflecting on your question, it does the same thing in the end.

So effectively no difference.

Also relevant <https://arstechnica.com/civis/viewtopic.php?t=401649> see reply of Accs who claims to have worked at cray, and also says in another post on the same thread "I think that Vector is nothing more than a special case of SIMD".

So, my error, hope above helps.

FYI you might want to read up on systolic arrays just for fun <https://en.wikipedia.org/wiki/Systolic_array>.

@tntn: thanks, my carelessness.

tntn · on May 27, 2019

CM-1, linked in 'dragontamer's comment.

Demiurge · on May 27, 2019

It is a wave however, rather than a circle. There are oscillations, back and fourth between all sorts of things. It is inevitable because the competing ideas and systems are oscillating, all the way down. Nature is not perfectly in sync, and that is indeed what makes things interesting.

swombat · on May 27, 2019

Usually the most apt metaphor is a spiral. We’ve gone round in a big circle, it feels like we’re back in the same place, except everything is completely different due to movement in a dimension which we didn’t even perceive.

And, well, you know, “words”... whatever works :-D

mycall · on May 27, 2019

https://www.youtube.com/watch?v=imdFhDbWDyM

amelius · on May 27, 2019

Except back then the driver wasn't privacy.

dahfizz · on May 27, 2019

Is privacy the only driver? I think convenience, speed, and cost are still big concerns, just like they were back in the day.

Network requests are always very slow when compared to processor speed. You can buy a device for 1/500 the cost of a server, put it in your closet, and it will be faster at almost every task than offloading to the "cloud".

I work tangential to the Telecom space, and all anyone cares about is the "edge" i.e. Consumer Premise Equipment and the like. With how cheap computer power is and how high network latency is, having centralized datacenters has become less and less desirable.

pavlov · on May 27, 2019

Yes, it’s more like a cycle of reincarnation: the same problems and solutions keep coming back in different guises.

HeWhoLurksLate · on May 27, 2019

And those problems and solutions are also somewhat caused by / solved by more modern technology.

gok · on May 27, 2019

The recent ML-focused chips are really just high efficiency, low precision matrix multipliers. It turns out that operation can be made ~10x more efficient than previous processor designs, and is the bottleneck in inference for modern neural network models. Specialized hardware can also accelerate the training somewhat.

So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.

But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.

vbezhenar · on May 27, 2019

What I always wonder is that AI requires data to learn from. When Google sends everything to their servers, they have that data from all over the world. So they can build better AI. If their competitors implement privacy-respecting solutions, they are doomed to be behind Google, because they don't have data to learn their AI. My personal anecdote: my speaking English is quite bad and Apple Siri does not understand me well. But Google Assistant does wonders, it's fascinating how well its speech recognition works.

I have a feeling that AI technology is inherently anti-privacy.

shpx · on May 27, 2019

Gboard (Google's keyboard on Android) has had voice recognition without the cloud for three months now, no extra work from ARM needed.

https://techcrunch.com/2019/03/12/googles-new-voice-recognit...

Looks like they're not uploading clips to themselves either, at least it doesn't show up in https://myactivity.google.com

They still store audio when you use Ok Google though.

freyr · on May 27, 2019

Google is rolling out on-device voice recognition too. They’ve integrated it into Gboard:

https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...

fpgaminer · on May 27, 2019

> Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.

Yes, powerful ARM embedded platforms are making this more and more possible and I'm at least excited about it.

Not only that, speech recognition and speech synthesis technologies are getting more efficient by the month. There was a paper just this past week (https://arxiv.org/pdf/1905.09263.pdf) which demonstrates speech synthesis with an order of magnitude faster performance. Advances like those make it easier to cram these technologies on local IoT platforms.

And there's certainly been a number of local focused AI projects over the past few years. I remember Snips.ai being decent for building local-only voice assistants. It worked on a Raspberry Pi, though you still have to hack together decent microphones for it.

All of this gives me hope for an open source voice assistant that will be competitive with the commercial options.

Now if the makers of smart accessories (lights, etc) would stop being anti-consumer for two minutes we might get decent accessories that don't require a cloud.

jaytaylor · on May 27, 2019

Were you able to try it out yourself? Curious how good it is compared to "OK Google".

P.S. Chuck, you are a great inspiration and friend. Thank you for all the kindness you've shown me.

dmix · on May 27, 2019

How much data would it require for a real-world voice "model" that is locally on a device, space-wise? I don't know much about machine learning's actual implementations within software, I've always been curious how you go from (say) Tensorflow -> embedded within real world software / APIs.

lrem · on May 27, 2019

Google Translate offline files seem to be about 50MB per language. Assume your speech recognition machinery breaks down utterances into phonemes, alphabet a couple times that of letters and some more entropy... My intuition is you should be in the 200-500MB range.

Disclaimer: I work in Google, but apparently have never touched any of this from the inside.

cpeterso · on May 27, 2019

Earlier this year, Google introduced an Android "Live Transcribe" feature that runs locally without the need for an internet connection:

https://www.theverge.com/2019/2/4/18209546/google-live-trans...

kyriakos · on May 27, 2019

And it's very accurate. I tested it with English and Greek for which usually dictation systems are very bad and was surprised how well it recognised every word.

tshaddox · on May 27, 2019

It doesn’t exactly answer your question, but Apple has a decent description of how local “Hey Siri” detection works on the iPhone and watch:

https://machinelearning.apple.com/2017/10/01/hey-siri.html

NotPaidToPost · on May 27, 2019

The limiting factor is generally not memory but processing and power consumption.

dmix · on May 27, 2019

Sure, my question was what the local algorithm/dataset looks like packaged up.

neysofu · on May 27, 2019

https://github.com/Picovoice/Porcupine

keerthiko · on May 27, 2019

I would hope at least chip designers/manufacturers will stop referring to chips designed and marketed specifically for AI/ML as "G"PUs, when they aren't really about graphics processing anymore.

dmitriid · on May 27, 2019

General Processing Units then? :)

rangibaby · on May 27, 2019

That's very clever, and reminds me of DVD (Digital Video Disc to Digital Versatile Disc)

pavlov · on May 27, 2019

Maybe “Parallel Processing Unit”, PPU?

earenndil · on May 27, 2019

PPU is used as 'Picture Processing unit', essentially the graphics card of older 2d consoles.

floatboth · on May 27, 2019

Also was "POWER Processing Unit" in the IBM Cell that was in the PlayStation 3.

orlp · on May 27, 2019

If a CPU is a Central Processing Unit then I guess we should call them DPUs: Distributed Processing Unit.

p1mrx · on May 27, 2019

Guided processing unit. The CPU tells it what to process.

amelius · on May 27, 2019

Yes, and hopefully that will stop the problems I get with video drivers when installing a GPU which I just use for raw computation. Also, my PC has 10 video outputs now, which I think is insane!

als0 · on May 27, 2019

In the case of ARM they seem to have announced Neural Processing Units (NPUs), which looks more like a mass market TPU.

ourlordcaffeine · on May 27, 2019

>Saw a system at Makerfaire that does voice recognition without the cloud

Voice recognition without the cloud has been possible for years, this is nothing new.

applecrazy · on May 27, 2019

You were at Maker Faire? Awesome, I'd love to hear your thoughts on the Faire in comparison with previous years.

mycall · on May 27, 2019

Nvidia's Jetson really is exciting what it can achieve.

HeWhoLurksLate · on May 27, 2019

As are AMD's newest announcements of Navi & Intel-smacking performance- they could probably lean over this way and jostle things around quite a bit if they wanted to.

ajxs · on May 27, 2019

> non-privacy-invading

> computers you could talk to

Pick one.

I can't help but feel somewhat pessimistic about technological developments in the handheld device domain. I say this as a supporter of ARM though. I have no reason to dislike ARM, I'm just somewhat apprehensive regarding any more empowerment of the companies we keep in our pockets.

Majestic121 · on May 27, 2019

Talinkg to a computer is an interface like any other (mouse/keyboard). There's no fundamental reason why talking to a computer would be inherently privacy invading, it just happens that most implementations today are, for technical/performance reasons.

ajxs · on May 27, 2019

Of course, I understand what an interface is. There's nothing inherently invasive about this kind of interface, except that...

> ...it just happens that most implementations today are...

hyperpallium · on May 27, 2019

The chart has 4 data points, but only 3 CPUs. The Cortex-76 is in the middle. Can anyone explain?

Also, in the wider socioeconomic picture, everyone once got the same computational technology eventually, because better was also cheaper.

Now, computation is stratifying, like any normal industry, where you get what you pay for. This ending of egalitarianism is bad.

0815test · on May 27, 2019

Computing technology is more egalitarian today than it ever was. Unless you're gaming, doing software development or serious science-related work, you can get away with something baseline and dirt-cheap, and it will work just fine. Maybe slightly less so in mobile but still, usable compute really isn't that niche even in that space.

hyperpallium · on May 27, 2019

True, it's more egalitarian today, than it was 10, 20, 30 years ago etc, but the trend I'm talking about has only just began (perhaps 2 years ago, barely an upgrade cycle). Where will it be in 10 years?

What market forces stop vendors selling small quantities of far better tech at far better profits? Price discrimination is fundamental in marketing: different people will (can) pay different prices. An extreme example: I believe the military gets more advanced process nodes, when there are only small runs, before mass production.

robocat · on May 27, 2019

> What market forces stop vendors selling small quantities of far better tech at far better profits?

I think the iPad Pro answers that. Could you buy or design a faster mobile processor than the A12X?

"High volume, high profit" trumps "low volume, extreme profit".

hyperpallium · on May 27, 2019

Oh, you have both, in fact a range of price points.

clouddrover · on May 27, 2019

> where you get what you pay for

I think that's always been true for CPUs and GPUs. The faster ones were always the more expensive ones.

onion2k · on May 27, 2019

To me "you get what you pay for" implies a linear relationship between cost and speed, and that has never really been the case. Sometimes you pay a lot more money for a little more speed.

ska · on May 27, 2019

There has always been a nonlinear ramp up for the latest and greatest - typically targeted to enterprise applications that are more time sensitive than cost sensitive.

If you are cringing at the idea of a 3k+ cpu (or 9k+ gpu), it quite literally wasn't built for you.

hyperpallium · on May 27, 2019

"eventually" the faster ones got cheaper, in previous times

clouddrover · on May 27, 2019

No, there was just a new model of processor and the cheaper variant of it was faster than the previous model's cheaper variant. Things haven't changed in that regard.

hyperpallium · on May 27, 2019

The new cheaper variant was never faster than a previous model's cheaper variant - even of a few generations back?

As one extreme counter point, the cheapest smartphones today have better performance than the original iphone.

oflordal · on May 27, 2019

Supposedly the step in the middle is SW optimizations?

holy_city · on May 27, 2019

This, 1000%. Especially floating point arithmetic. Compilers aren't that great at vectorization (or more accurately, people aren't that great at writing algorithms that can be vectorized), and scalar operations on x86_64 have the same latency as their vectorized counterparts. When you account for how smart the pipelining/out-of-order engines on x86_64 CPUs are, even with additional/redundant arithmetic you can achieve >4x throughput for the same algorithm.

Audio is one of the big areas where we can see huge gains, and I think philosophies about optimization are changing. That said, there are myths out there like "recursive filters can't be vectorized" that need to be dispelled.

slaymaker1907 · on May 27, 2019

It has gotten very tricky on x86 since using the really wide SIMD instructions can more than halve your clock speed for ALL instructions running on that particular core.

holy_city · on May 27, 2019

iirc that's only with some processors using AVX2/512 instructions, which are still faster than scalar (depending on what you're doing around the math).

Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.

There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.

blu42 · on May 27, 2019

One guess re the second CA76 datapoint would be N1.

hyperpallium · on May 27, 2019

N1? please elaborate

blu42 · on May 27, 2019

The datacentre (AKA 'Neoverse') CA76 variant.

Causality1 · on May 27, 2019

>the company argues that 85 percent of smartphones today run machine learning workloads with only a CPU or a CPU+GPU combo.

Gotta admit I'm not real clear on what my phone does that needs on-device machine learning.

bsaul · on May 27, 2019

Was about to write the same comment. Plus, i must say every time i hear some function is going to be performed via a NN, i'm thinking that my device is becoming more and more random in the way its behave. Which i really don't like.

At least when some classical algorithm fails at performing certain task, people talk of the failure as a bug. Not as something that's "probably going to be solved with more training data".

antpls · on May 27, 2019

If you need actual examples running right now :

  Recent Snapchat and Instagram selfie filters
  Google Keyboard's translation and prediction
  Google Traduction's lens
  Google Assistant's voice recognition
  Google Message's instant replies

Almost all of them are inference workloads. I believe only Google Keyboard does on-device training in the background when the phone is charging.

vitorgrs · on May 28, 2019

A few more:

- Photos and a few other things on iOS, are all local. Same thing with Windows 10 (Photos app on Windows, Keyboard suggestion too I believe)

- Microsoft Edge pages suggestion (really, it's a onnx model)

- OCR Text Recognition on Windows

PhilippGille · on May 27, 2019

Automatic enhancements when taking a photo / video (digital zoom, anti shake, "night mode") for example.

mceachen · on May 27, 2019

Applying an existing, trained model (like what you described) is computationally cheap. Training new models takes many, many orders of magnitude more time and memory space to complete.

jononor · on May 27, 2019

Much cheaper than training for sure. But inference for image classification it is generally in the order of hundreds of milliseconds when using CPU. That wont fly for real-time inference on video, for example. Which is very desirable for even "image" uses, because it allows for live preview in the camera app.

michaelt · on May 27, 2019

So it turns out the hardware needed to run a pretrained model is pretty much the same as the hardware needed to train a model. In both cases, it means lots of matrix multiplication.

Of course, training a model takes longer given the same amount of processing power - but for applications like video processing, just applying the model can be pretty demanding.

bufferoverflow · on May 27, 2019

2019 Google I/O presenters talked about it: much much better latency, no need to send the audio to the server for speech recognition.

sixothree · on May 27, 2019

Besides that data is no longer useful to them.

TazeTSchnitzel · on May 27, 2019

I assume it's what Apple uses to automatically categorise photos by “moment” and recognised face.

cheerlessbog · on May 27, 2019

I assume they mean inference. Voice assistant or photo facial recognition without network access?

iamnothere · on May 28, 2019

Determining whether or not nearby ambient sounds indicate potentially illegal activity. Or, perhaps, mapping the content and emotional valence of conversations to understand whether or not you deserve a lower social credit score.

anonuser123456 · on May 27, 2019

Inference, not training.

sigmonsays · on May 27, 2019

can someone explain to me why I want ML specific processor features or chips in my phone?

I thought ML required massive amounts of data to be taught, most of which makes more sense in the cloud.

Am I way off here?

btown · on May 27, 2019

Training a model does require massive data and compute, but evaluating/using an already model (e.g. running a Hot Dog/Not Hot Dog classifier) can be done on mobile hardware. Accelerating this could, for instance, allow this to run in real time on a video feed.

joshvm · on May 27, 2019

Aside from the runtime difference between training and inference, having on-device ML makes a lot of sense for other reasons.

There can be more guarantees over data privacy, since your data can stay on-device. It also reduces bandwidth as there's no need to upload data for classification to the cloud. And that also may mean it's faster, potentially real time, since you don't have that round trip latency.

This is not necessarily for phones. Lots of (virtually all?) low power IoT devices have ARM cores. There are plenty of environments where the cloud or compute power isn't available.

jopsen · on May 27, 2019

From what little I understand models are built in the cloud, compressed and evaluated on your phone.

Better hardware probably means less power drain and larger models.

There was some cool stuff about this in Google I/O keynote.

jaydj · on May 27, 2019

Check out federated learning. https://medium.com/tensorflow/introducing-tensorflow-federat...

Traster · on May 27, 2019

Machine learning is like any learning, there's the learning stage and the putting it into practice. ML in the cloud is like researchers coming up with a new way to slice bread, ML on your phone is like the your local baker following the instructions in a new cook book. You still need a skilled baker to follow the instructions.

NegatioN · on May 27, 2019

It doesn't have to be for training the models. You can run versions of trained models locally on your phone. Having a dedicated chip will allow for more snappy calculation of those fancy snapchat filters, language translation, image recognition etc

mruts · on May 27, 2019

The way neural networks work is that each perceptron runs a linear equation of the form: w1x1 + w2x2...wnxn. Even if you train the model somewhere else, you still need to hold the weights locally in each perceptron in order to evaluate the model.

This requires hardware in which you can multiply these huge matrices quickly, even if the weights are downloaded from the cloud.

mensetmanusman · on May 28, 2019

E.g. accurate edge detection algorithms are better done with ML-based mathematics and accompanying optimized transistor layouts, this would enable better augmented reality type graphic additions from things like snapchat selfies to visualizing free-space CAD.

fxj · on May 27, 2019

How does this NNPU compare to the other vendors (Goolge TPU (3 TOPS?, Intel NCS2 (1 TOPS), kendryte RISC-V (0.5 TOPS), Nvidia Jetson (4 TOPS)? Can you use tensorflow networks out of the box like the others provide?

KirinDave · on May 27, 2019

Does someone have a mirror of this article that doesn't seize up if you refuse to display ads?

jakeogh · on May 27, 2019

Disable JS. I leave it off by default, the web is drastically nicer without it.

KirinDave · on May 27, 2019

I tried this and the full article didn't render.

jakeogh · on May 27, 2019

I'm on surf: http://surf.suckless.org/ (webkit2) and it rendered nicely, inline images and all.

Surf has a keybinding Ctrl-Shift-S to toggle JS. Launch it with -s to disable by default.

czr · on May 27, 2019

https://outline.com/5PdMYy

Though FWIW, the article renders fine for me with uBlock Origin + Extra.

ksec · on May 28, 2019

I think it is worth pointing out this new CPU possibly landing in Flagship Android in late 2019 or early 2020 would still only be equal to or slower than an Apple A10 used in iPhone 8.

Assuming Apple continue like they do in the past, drop iPhone 7 and moves iPhone 8 down to its price range. They would have an entry level iPhone that is faster than 95% of all Android Smartphone on the market.

Liquid_Fire · on May 28, 2019

Isn't the entry level iPhone also significantly more expensive than most Android phones?

ksec · on May 28, 2019

Depends on Brand, although that will most likely be true since Top 5 Android Phones maker are all Chinese except Samsung.

narnianal · on May 27, 2019

What even is an "AI chip"? What's the difference to a GPU? As long as nobody can explain that I have big doubts that it whould be more than a GPU+marketing. So no big deal if they don't provide one.

fxj · on May 27, 2019

This link gives a nice explanation of the google nn-processor aka TPU:

https://cloud.google.com/blog/products/gcp/an-in-depth-look-...

It boils down to:

CPUs: 10s of cores

GPUs: 1000s of cores

NNs: 100000s of cores

NNs have very simple cores (fused multiply-add and look-up table functions) but can run many of them in one cycle.

FTA: Because general-purpose processors such as CPUs and GPUs must provide good performance across a wide range of applications, they have evolved myriad sophisticated, performance-oriented mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU design is strictly minimal and deterministic as it has to run only one task at a time: neural network prediction. You can see its simplicity in the floor plan of the TPU die.

acidbaseextract · on May 27, 2019

Modern GPUs are extremely programmable, but this flexibility isn't that heavily used by neural networks. NN inference is pretty much just a huge amount of matrix multiplication.

Especially for mobile applications (most Arm customers), you pay extra energy for all that pipeline flexibility that isn't being used. A dedicated chip will save a bunch of power.

endorphone · on May 27, 2019

Neural network learning and inference primarily uses matrix multiplication and addition, usually at lower bit depths. You can do this on GPUs to great success and massive parallelism, however the GPU is more generalized than this so it takes more silicon, and more power. With a TPU/neural processor you optimize the silicon to a very, very specific problem, generally multiplying large matrixes and then adding to another matrix. On a GPU we decompose this into a large number of scalar calculations and it massively parallelizes and does a good job, but on a TPU we feed it the matrixes and that's all it's made to do, with very large scale matrix operations, with so much silicon dedicated to matrix operations that it often does it in a single cycle.

Another comment mentioned cores and I don't think that's a good way of looking at it, as in most ways a TPU is back to very "few" but hyper-specialized "cores". There is essentially no parallelism in a TPU or neural processor -- you feed it three matrixes and it gives you the result. You move on to the next one.

mruts · on May 27, 2019

an AI chip would essentially just be a chip that can do matrix multiplication very quickly plus some addition. Each perceptron (neuron) is just fitting a linear equation, so if we had a chip that could support millions of perceptrons all fitting linear equations (i.e with matrix multiplication), than it would be a huge win compared to a GPUs, which are more general and less efficient for this specific task that dedicated silicon.

lrem · on May 27, 2019

I would guess a chip optimised for addition, multiplication and ReLU, not burdened too much by all those other rarely used opcodes.

lone_haxx0r · on May 27, 2019

Unreadable. As soon as I click on the (X) button, the article closes and the browser goes back to the root page of https://techcrunch.com/.

There's no way to avoid clicking on it, as it follows you around and grabs your attention, not letting you read.

founderling · on May 27, 2019

Even when you fight that annoyance with a content blocker, the page itself is aggressive to no end.

I scrolled down to see how long the article is. That somehow also triggered a redirect to the root.

How is it possible that the most user hostile news sites often get the most visibility? Even here on HN which is so user friendly. What are the mechanics behind this?

The url should be changed to a more user friendly news source. How about one of these?

https://liliputing.com/2019/05/arm-launches-cortex-a77-cpu-m...

https://hexus.net/tech/news/cpu/130757-arm-releases-cortex-a...

https://www.xda-developers.com/arm-cortex-a77-cpu-announceme...

https://www.theregister.co.uk/2019/05/27/arm_cortex_a77/

https://venturebeat.com/2019/05/26/arm-reveals-new-cpu-gpu-a...

looeee · on May 27, 2019

> Even here on HN which is so user friendly.

Except on mobile. The UI elements are tiny, try closing ten comments without clicking the timestamp by mistake at least once.

wongarsu · on May 27, 2019

And to vote on mobile I have to zoom in because the buttons are way to close to each other and give no feedback which of the two was pressed (and no way to find out after the fact).

kgermino · on May 27, 2019

One tip, when you up vote or down vote a link is added to the comments header. It will either be “unvote” or “undown” depending on whether you up or downvoted the comment.

chronial · on May 27, 2019

I can recommend https://hackerweb.app/ on mobile.

knolan · on May 27, 2019

Also worth mentioning AnandTech:

https://www.anandtech.com/show/14407/amd-ryzen-3000-announce...

mactrey · on May 27, 2019

I think you mean:

https://www.anandtech.com/show/14384/arm-announces-cortexa77...

https://www.anandtech.com/show/14385/arm-announces-malig77-g...

knolan · on May 27, 2019

Yes, thank you.

glangdale · on May 27, 2019

The AnandTech coverage is orders of magnitude better, and way less user-hostile, than the techcrunch coverage.

solarkraft · on May 27, 2019

Thanks! It'd be cool if HN had an alternative article feature.

richeyryan · on May 27, 2019

If you click the "web" link at the top of the page, it brings you to a search for the post title. You should be able to find a number of alternate sources there

narnianal · on May 27, 2019

And when you open the page every time it wants to go through all the cookie and partner options instead of just saying "yup, we remember you deactivated everything. [change settings] [continue reading]"

SmellyGeekBoy · on May 27, 2019

But by what mechanism would it remember that you deactivated cookies?

PeterisP · on May 27, 2019

You don't deactivate all cookies as such, you deactivate personalised tracking cookies.

So the site shouldn't place a cookie 'session_id=12345678', but a cookie 'techcrunch_tracking_enabled=False' that's not linked to any particular user would be just fine.

narnianal · on May 28, 2019

by keeping a "necessary" cookie for instance.

And you don't think they really stop tracking you just because you said you didn't like it, right? What would you even do against it? Sue each and every website owner on the planet somewhere in the EU where they might not even care?

_Understated_ · on May 27, 2019

Techcrunch is terrible for reading articles now unless you disable JavaScript... works perfectly without JS

mastazi · on May 27, 2019

Not to mention that Techcrunch is far from being the best source in relation to hardware...

progval · on May 27, 2019

It works fine if you disable Javascript

ljcn · on May 27, 2019

3rd party javascript turned off here - page loads quickly, is ad-free, readable, and includes the figures.

moonbas3 · on May 27, 2019

Try the Reader Mode in your browser. Don't know if Chrome has it, but it's there in Safari and Firefox.

mcv · on May 27, 2019

The circle with the X is weird design, because it looks like a button that's supposed to close something, but as long as you don't click it, the article is perfectly readable. You just need to resist your reflexes to click any X in sight.

iamzozo · on May 27, 2019

Also checked the titlebar in chrome, is it zoomed..?

esotericn · on May 27, 2019

I can't get past the Oath GDPR popup. None of the buttons work and go in to this strange endless loop.

amirmasoudabdol · on May 27, 2019

Sad autocorrection of "ARM" into "Arm" in the title on the text!

saagarjha · on May 27, 2019

Arm is the company name: https://www.arm.com

amirmasoudabdol · on May 27, 2019

It was always the company name, the logo is lowercase now, but the abbreviation should still be upper case.

saagarjha · on May 27, 2019

But I don't think the article is using the abbreviation?

samcday · on May 27, 2019

I thought they go by "Arm" now? Their website seems to indicate as much.

otherflavors · on May 27, 2019

It is the evolution of the arm

bufferoverflow · on May 27, 2019

MzxgckZtNqX5i · on May 27, 2019

Wrong thread perhaps?

_x5md · on May 27, 2019

hmm, I feel ARM performance is a bit like nuclear fusion. It's always the next generation that will deliver an order of magnitude performance increase. Yet some how ARM single core performance is still shit compared to x86. (No matter how much I hope and pray for that to change, cause x86 needs to die)

rwmj · on May 27, 2019

I'm afraid you're simply wrong here. Single core performance of the (now sadly deceased) Qualcomm Amberwing was incredible, easily Xeon-like. And it had 46 cores per socket. The Cavium ThunderX2 also has great single core performance coupled with many more cores than you can get on an Intel chip. Apples iPhone cores are also supposed to be good. (Edit: These are ARM cores, but not ARM-as-in-the-company designs)

_x5md · on May 27, 2019

Interesting, I was mostly thinking about the ARM's own cores, but never _really_ looked into the broader ecosystem.

From my personal experience, Apple chips are fast, but I can't really compare that to anything else. JS benchmarks match desktop performance, but that is not really what I was/am interested in. I've tried some 'cloud ARM-based metal servers' which promised to have similar performance as intel Atom cpu's basically, but they felt at least 10 times slower. So I gave up on the whole concept.

But I guess the take-away is, those specific systems where slow not so much ARM in general. I mean, it does make sense that there is some scale difference between a Xeon core vs a core meant for mobile use now that I actually think about it ;)

pzo · on May 27, 2019

I benchmarked few image processing algorithms (opencv) on my iphone xs. I was surprised processing was faster than on my quadcore i7 macbook pro 2012. Sure cpu is 6 generation behind but there haven't been much performance improvements in the last 7 years on x64. Probably around + ~50% single thread.

dethac · on May 27, 2019

Image processing is probably offloaded to some sort of custom silicon. If that is the case, then the generational performance uplift would be massive.

Consider this:

Decoding a big fat X265 file on your 2012 MBP would be atrocious because it doesn't have hardware X265 decode, but decoding it on your XS would probably be fine.

pzo · on May 27, 2019

That could be true for only some algorithms but I have investigated the source code as well and on ARM it uses only NEON instructions if available (but its equivalent of SSE and AVX2 on x86 anyway).

The result is the same with my own crafted image processing algorithms using single core. Apple makes really fast CPUs.

runeks · on May 27, 2019

That’s six years of difference, though. That’s not a fair comparison.

How much has x86 performance increased over the past six years?

pzo · on May 27, 2019

Not that much. Each generation there was maybe 7% perf improvement so in 6 years: 1.07^6=1.50. Improvement was mostly with adding more cores.

This ~1.5x in single thread performance improvement matches with stats from geekbench.com benchmark:

my macbook pro 2012 cpu: i7-3615QM => 3093 single thread [1] equivalent current gen cpu with same TDP (45W) and similar clock (2.3GHz): i7-8750H => 4617 single thread [2]

[1] https://browser.geekbench.com/processors/741 [2] https://browser.geekbench.com/processors/2144

mises · on May 27, 2019

Before or after meltdown, spectre, l1tf, foreshadow, zombieload, ridl, and fallout patches? I would honestly guess that these new round have cost a huge amount of progress. I certainly won't be applying the latest three; I spent too much on a nice machine to have it perform as though it were four years old.

pjc50 · on May 27, 2019

Is this true for comparable TDP processors? Speaking as the owner of one of the few x86 Android phones, the Intel performance isn't better on small devices.

_x5md · on May 27, 2019

Nah, I wasn't thinking. Comparing what are essentially mobile cores with i7/9 desktop cores is just BS. Please disregard my previous statements ;)

mises · on May 27, 2019

An iPad Pro's single-threaded geekbench performance is better than the 8th-gen i7 in my laptop, according to geekbench. Multithreaded is better on my laptop. This is comparing the iPad to just the CPU; my actual model is almost certainly better (particularly if you consider thermals). This comparison is also with Apple's additional (not inconsiderable) design work on top. I also sincerely doubt an iPad could sustain this. But the point is the same: an a12x macbook pro could be a serious contender.

wongarsu · on May 27, 2019

Nvidias single core performance is even worse, yet their GPUs sell great. Lots of workloads have no need for individual fast cores

The_rationalist · on May 27, 2019

When will Deep learning frameworks get a reality check and decide to support openCL/SYCL? All this hardware is useless silicone until then

Joky · on May 27, 2019

Tensorflow has the XLA compiler for targeting accelerators. In practice the integration isn't necessarily amazing right now, but the plan is to make it much better and able to generalize more easily.

Disclaimer: I work on the MLIR/XLA team.

The_rationalist · on May 27, 2019

The openCL tensorflow issue is still open and no Google dev has shown interest.. Yet I'm curious about your progress on this and when this could target AMD / Intel / ARM

thomasfl · on May 27, 2019

The day Apple releases their first ARM powered laptop, will will be a turning point. This comment written on an ipad, the best product to come out of apple to yhis day.

xvector · on May 27, 2019

> out of apple to yhis day.

I suppose keyboards haven't really been a strong point!

kkarakk · on May 27, 2019

ipad keyboard is just as weird as the macbook keyboard but in different ways(the layout for example)

ive's design group just hates standard keyboard design i guess

xvector · on May 27, 2019

I know. I was just pulling your leg :)

Narishma · on May 28, 2019

Different person.

imperialdrive · on May 27, 2019

While I don't agree with the first part, I do agree that the iPad is unique and worthy being the only Apple device my hard earned money has ever gone towards.

intricatedetail · on May 27, 2019

Which one do you have? I have iPad 2 and it still works great. Is it worth upgrading?

leadingthenet · on May 27, 2019

iPad Air 2 or iPad 2? Because you should most definitely upgrade if you have an old iPad 2. The new ones are truly amazing, especially the Pro models.

intricatedetail · on May 27, 2019

I have the old one. It serves me well so I am intrigued.

xvector · on May 28, 2019

I have heard nothing but universal praise for the Pro model. But no point in upgrading just to upgrade - the beautiful thing about Apple products is that they seem to work forever.

pjmlp · on May 27, 2019

Alan Kay might disagree with its uniqueness.