This is interesting to me, lots of noise around "machine learning" in the GPU rather than graphics which kind of validates Google's TPU, Apple's AI co-processors, and Nvidia's Jetson. Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.
GPUs are the 1980s style SIMD compute system. A GPU is designed to run many, many, many threads. A Vega64 for example, can run 163,840 threads at the same time (4096 shaders x 10-way SMT x 4-threads per unit)... while a CPU Threadripper 2950x can "only" run 32 threads at once (16 cores x 2-way SMT).
However, the GPU-shader is far more constrained than a CPU-thread. A GPU-shader is "locked" to performing the same tasks as other cores. That is to say: 64 "GPU threads" on AMD systems have the same program counter / instruction pointer.
32 "GPU threads" on NVidia systems have the same program counter.
This means that if one thread loops 10,000 times in a program, then that will force 32 other threads (on NVidia) or 64 other threads (on AMD) to loop 10,000 times (even if those threads individually only need to loop 500 times).
This restriction doesn't really matter in cases of matrix multiplication: where all threads would loop the same number of times. But in cases of like, Chess Minimax algorithm, the "thread divergence" makes it difficult to actually port Chess Minimax to a GPU.
-------
Machine learning... (EDIT: or more specifically, Convolutional Neural Networks) is almost purely a matrix multiplication problem, making it ideal for the SIMD GPU architecture.
CPUs have SIMD units by the way (SSE, AVX, etc. etc.), but GPUs are specialized systems that can only do SIMD. So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit.
-------
EDIT: Google's TPU goes beyond the GPU-model (SIMD architecture) and has dedicated matrix-multiplication units. These units can only perform matrix multiplication, nothing else, and are designed to do it as quickly and power-efficiently as possible.
NVidia's "Turing" line of GPUs have dedicated matrix multiplication units called "tensor units". NVidia basically grafted a dedicated FP16 matrix multiplication unit onto their GPUs.
"So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit."
Except if memory latency transfer is a bottleneck.
I've seen many examples of SIMD beating GPU e.g pathfinder v3 text renderer
While memory latency transfer between the CPU and GPU is an issue, GPUs have absolutely incredible throughput when moving things from one place on the GPU somewhere else on the GPU. This is why researchers are looking at GPUs for running graph algorithms.
It’s interesting that the cycle is complete and we’re back to the 1975 situation: “Wouldn’t it be great to have a personal computer that could do all these things we currently do by connecting to a mainframe?”
All modern computers, even cell phones, have the capabilities of mainframes from the 1970s and 1980s.
Back in the 1980s, a personal computer was loaded with an operating system that could only support a single user. A "mainframe" was a system that could support more than one user at a time, and you'd connect to it using telnet.
Today, your Linux PC loaded with SSH effectively functions as a mainframe. Heck, you can load virtual machines that themselves can pretend to be mainframes from the 1970s. Heck... your cell phone uses this feature and pretends that apps are different users for maximum security.
----
In contrast: GPUs are simply the 1980s style SIMD super-computers. It turns out SIMD Supercomputer style was the fastest way to calculate where pixels go on the screen (Conceptually: each pixel-shader basically gets to run its own thread: so a 1920x1080 screen has 2073600 threads to run per frame)
> GPUs are simply the 1980s style SIMD super-computers
May be a terminology thing, or a technology I don't know, but I though '80s supercomputers like crays were vector processors, whereas I'm not aware of any commercial super at the time being SIMD (AKA array processor).
It seems to be a terminology confusion of mine. I did a bit of reading and it seems they're effectively the same thing implemented differently. An array processor in my mind was a lockstep SIMD thingy.
A vector processor as I understood it is nicely described here <https://www.quora.com/What-is-meant-by-an-array-processor-an... as "the earliest Crays had vector instructions, that quickly fed a stream of operands through a pipelined processor", which is not exactly the same implementation as SIMD, but after reflecting on your question, it does the same thing in the end.
So effectively no difference.
Also relevant <https://arstechnica.com/civis/viewtopic.php?t=401649> see reply of Accs who claims to have worked at cray, and also says in another post on the same thread "I think that Vector is nothing more than a special case of SIMD".
It is a wave however, rather than a circle. There are oscillations, back and fourth between all sorts of things. It is inevitable because the competing ideas and systems are oscillating, all the way down. Nature is not perfectly in sync, and that is indeed what makes things interesting.
Usually the most apt metaphor is a spiral. We’ve gone round in a big circle, it feels like we’re back in the same place, except everything is completely different due to movement in a dimension which we didn’t even perceive.
And, well, you know, “words”... whatever works :-D
Is privacy the only driver? I think convenience, speed, and cost are still big concerns, just like they were back in the day.
Network requests are always very slow when compared to processor speed. You can buy a device for 1/500 the cost of a server, put it in your closet, and it will be faster at almost every task than offloading to the "cloud".
I work tangential to the Telecom space, and all anyone cares about is the "edge" i.e. Consumer Premise Equipment and the like. With how cheap computer power is and how high network latency is, having centralized datacenters has become less and less desirable.
The recent ML-focused chips are really just high efficiency, low precision matrix multipliers. It turns out that operation can be made ~10x more efficient than previous processor designs, and is the bottleneck in inference for modern neural network models. Specialized hardware can also accelerate the training somewhat.
So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.
But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.
What I always wonder is that AI requires data to learn from. When Google sends everything to their servers, they have that data from all over the world. So they can build better AI. If their competitors implement privacy-respecting solutions, they are doomed to be behind Google, because they don't have data to learn their AI. My personal anecdote: my speaking English is quite bad and Apple Siri does not understand me well. But Google Assistant does wonders, it's fascinating how well its speech recognition works.
I have a feeling that AI technology is inherently anti-privacy.
> Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.
Yes, powerful ARM embedded platforms are making this more and more possible and I'm at least excited about it.
Not only that, speech recognition and speech synthesis technologies are getting more efficient by the month. There was a paper just this past week (https://arxiv.org/pdf/1905.09263.pdf) which demonstrates speech synthesis with an order of magnitude faster performance. Advances like those make it easier to cram these technologies on local IoT platforms.
And there's certainly been a number of local focused AI projects over the past few years. I remember Snips.ai being decent for building local-only voice assistants. It worked on a Raspberry Pi, though you still have to hack together decent microphones for it.
All of this gives me hope for an open source voice assistant that will be competitive with the commercial options.
Now if the makers of smart accessories (lights, etc) would stop being anti-consumer for two minutes we might get decent accessories that don't require a cloud.
How much data would it require for a real-world voice "model" that is locally on a device, space-wise? I don't know much about machine learning's actual implementations within software, I've always been curious how you go from (say) Tensorflow -> embedded within real world software / APIs.
Google Translate offline files seem to be about 50MB per language. Assume your speech recognition machinery breaks down utterances into phonemes, alphabet a couple times that of letters and some more entropy... My intuition is you should be in the 200-500MB range.
Disclaimer: I work in Google, but apparently have never touched any of this from the inside.
And it's very accurate. I tested it with English and Greek for which usually dictation systems are very bad and was surprised how well it recognised every word.
I would hope at least chip designers/manufacturers will stop referring to chips designed and marketed specifically for AI/ML as "G"PUs, when they aren't really about graphics processing anymore.
Yes, and hopefully that will stop the problems I get with video drivers when installing a GPU which I just use for raw computation. Also, my PC has 10 video outputs now, which I think is insane!
As are AMD's newest announcements of Navi & Intel-smacking performance- they could probably lean over this way and jostle things around quite a bit if they wanted to.
I can't help but feel somewhat pessimistic about technological developments in the handheld device domain. I say this as a supporter of ARM though. I have no reason to dislike ARM, I'm just somewhat apprehensive regarding any more empowerment of the companies we keep in our pockets.
Talinkg to a computer is an interface like any other (mouse/keyboard).
There's no fundamental reason why talking to a computer would be inherently privacy invading, it just happens that most implementations today are, for technical/performance reasons.
Computing technology is more egalitarian today than it ever was. Unless you're gaming, doing software development or serious science-related work, you can get away with something baseline and dirt-cheap, and it will work just fine. Maybe slightly less so in mobile but still, usable compute really isn't that niche even in that space.
True, it's more egalitarian today, than it was 10, 20, 30 years ago etc, but the trend I'm talking about has only just began (perhaps 2 years ago, barely an upgrade cycle). Where will it be in 10 years?
What market forces stop vendors selling small quantities of far better tech at far better profits? Price discrimination is fundamental in marketing: different people will (can) pay different prices. An extreme example: I believe the military gets more advanced process nodes, when there are only small runs, before mass production.
To me "you get what you pay for" implies a linear relationship between cost and speed, and that has never really been the case. Sometimes you pay a lot more money for a little more speed.
There has always been a nonlinear ramp up for the latest and greatest - typically targeted to enterprise applications that are more time sensitive than cost sensitive.
If you are cringing at the idea of a 3k+ cpu (or 9k+ gpu), it quite literally wasn't built for you.
No, there was just a new model of processor and the cheaper variant of it was faster than the previous model's cheaper variant. Things haven't changed in that regard.
This, 1000%. Especially floating point arithmetic. Compilers aren't that great at vectorization (or more accurately, people aren't that great at writing algorithms that can be vectorized), and scalar operations on x86_64 have the same latency as their vectorized counterparts. When you account for how smart the pipelining/out-of-order engines on x86_64 CPUs are, even with additional/redundant arithmetic you can achieve >4x throughput for the same algorithm.
Audio is one of the big areas where we can see huge gains, and I think philosophies about optimization are changing. That said, there are myths out there like "recursive filters can't be vectorized" that need to be dispelled.
It has gotten very tricky on x86 since using the really wide SIMD instructions can more than halve your clock speed for ALL instructions running on that particular core.
iirc that's only with some processors using AVX2/512 instructions, which are still faster than scalar (depending on what you're doing around the math).
Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.
There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.
Was about to write the same comment. Plus, i must say every time i hear some function is going to be performed via a NN, i'm thinking that my device is becoming more and more random in the way its behave. Which i really don't like.
At least when some classical algorithm fails at performing certain task, people talk of the failure as a bug. Not as something that's "probably going to be solved with more training data".
Recent Snapchat and Instagram selfie filters
Google Keyboard's translation and prediction
Google Traduction's lens
Google Assistant's voice recognition
Google Message's instant replies
Almost all of them are inference workloads. I believe only Google Keyboard does on-device training in the background when the phone is charging.
Applying an existing, trained model (like what you described) is computationally cheap. Training new models takes many, many orders of magnitude more time and memory space to complete.
Much cheaper than training for sure. But inference for image classification it is generally in the order of hundreds of milliseconds when using CPU.
That wont fly for real-time inference on video, for example. Which is very desirable for even "image" uses, because it allows for live preview in the camera app.
So it turns out the hardware needed to run a pretrained model is pretty much the same as the hardware needed to train a model. In both cases, it means lots of matrix multiplication.
Of course, training a model takes longer given the same amount of processing power - but for applications like video processing, just applying the model can be pretty demanding.
Determining whether or not nearby ambient sounds indicate potentially illegal activity. Or, perhaps, mapping the content and emotional valence of conversations to understand whether or not you deserve a lower social credit score.
Training a model does require massive data and compute, but evaluating/using an already model (e.g. running a Hot Dog/Not Hot Dog classifier) can be done on mobile hardware. Accelerating this could, for instance, allow this to run in real time on a video feed.
Aside from the runtime difference between training and inference, having on-device ML makes a lot of sense for other reasons.
There can be more guarantees over data privacy, since your data can stay on-device. It also reduces bandwidth as there's no need to upload data for classification to the cloud. And that also may mean it's faster, potentially real time, since you don't have that round trip latency.
This is not necessarily for phones. Lots of (virtually all?) low power IoT devices have ARM cores. There are plenty of environments where the cloud or compute power isn't available.
Machine learning is like any learning, there's the learning stage and the putting it into practice. ML in the cloud is like researchers coming up with a new way to slice bread, ML on your phone is like the your local baker following the instructions in a new cook book. You still need a skilled baker to follow the instructions.
It doesn't have to be for training the models. You can run versions of trained models locally on your phone.
Having a dedicated chip will allow for more snappy calculation of those fancy snapchat filters, language translation, image recognition etc
The way neural networks work is that each perceptron runs a linear equation of the form: w1x1 + w2x2...wnxn. Even if you train the model somewhere else, you still need to hold the weights locally in each perceptron in order to evaluate the model.
This requires hardware in which you can multiply these huge matrices quickly, even if the weights are downloaded from the cloud.
E.g. accurate edge detection algorithms are better done with ML-based mathematics and accompanying optimized transistor layouts, this would enable better augmented reality type graphic additions from things like snapchat selfies to visualizing free-space CAD.
How does this NNPU compare to the other vendors (Goolge TPU (3 TOPS?, Intel NCS2 (1 TOPS), kendryte RISC-V (0.5 TOPS), Nvidia Jetson (4 TOPS)?
Can you use tensorflow networks out of the box like the others provide?
I think it is worth pointing out this new CPU possibly landing in Flagship Android in late 2019 or early 2020 would still only be equal to or slower than an Apple A10 used in iPhone 8.
Assuming Apple continue like they do in the past, drop iPhone 7 and moves iPhone 8 down to its price range. They would have an entry level iPhone that is faster than 95% of all Android Smartphone on the market.
What even is an "AI chip"? What's the difference to a GPU? As long as nobody can explain that I have big doubts that it whould be more than a GPU+marketing. So no big deal if they don't provide one.
NNs have very simple cores (fused multiply-add and look-up table functions) but can run many of them in one cycle.
FTA: Because general-purpose processors such as CPUs and GPUs must provide good performance across a wide range of applications, they have evolved myriad sophisticated, performance-oriented mechanisms. As a side effect, the behavior of those processors can be difficult to predict, which makes it hard to guarantee a certain latency limit on neural network inference. In contrast, TPU design is strictly minimal and deterministic as it has to run only one task at a time: neural network prediction. You can see its simplicity in the floor plan of the TPU die.
Modern GPUs are extremely programmable, but this flexibility isn't that heavily used by neural networks. NN inference is pretty much just a huge amount of matrix multiplication.
Especially for mobile applications (most Arm customers), you pay extra energy for all that pipeline flexibility that isn't being used. A dedicated chip will save a bunch of power.
Neural network learning and inference primarily uses matrix multiplication and addition, usually at lower bit depths. You can do this on GPUs to great success and massive parallelism, however the GPU is more generalized than this so it takes more silicon, and more power. With a TPU/neural processor you optimize the silicon to a very, very specific problem, generally multiplying large matrixes and then adding to another matrix. On a GPU we decompose this into a large number of scalar calculations and it massively parallelizes and does a good job, but on a TPU we feed it the matrixes and that's all it's made to do, with very large scale matrix operations, with so much silicon dedicated to matrix operations that it often does it in a single cycle.
Another comment mentioned cores and I don't think that's a good way of looking at it, as in most ways a TPU is back to very "few" but hyper-specialized "cores". There is essentially no parallelism in a TPU or neural processor -- you feed it three matrixes and it gives you the result. You move on to the next one.
an AI chip would essentially just be a chip that can do matrix multiplication very quickly plus some addition. Each perceptron (neuron) is just fitting a linear equation, so if we had a chip that could support millions of perceptrons all fitting linear equations (i.e with matrix multiplication), than it would be a huge win compared to a GPUs, which are more general and less efficient for this specific task that dedicated silicon.
Even when you fight that annoyance with a content blocker, the page itself is aggressive to no end.
I scrolled down to see how long the article is. That somehow also triggered a redirect to the root.
How is it possible that the most user hostile news sites often get the most visibility? Even here on HN which is so user friendly. What are the mechanics behind this?
The url should be changed to a more user friendly news source. How about one of these?
And to vote on mobile I have to zoom in because the buttons are way to close to each other and give no feedback which of the two was pressed (and no way to find out after the fact).
One tip, when you up vote or down vote a link is added to the comments header. It will either be “unvote” or “undown” depending on whether you up or downvoted the comment.
If you click the "web" link at the top of the page, it brings you to a search for the post title. You should be able to find a number of alternate sources there
And when you open the page every time it wants to go through all the cookie and partner options instead of just saying "yup, we remember you deactivated everything. [change settings] [continue reading]"
You don't deactivate all cookies as such, you deactivate personalised tracking cookies.
So the site shouldn't place a cookie 'session_id=12345678', but a cookie 'techcrunch_tracking_enabled=False' that's not linked to any particular user would be just fine.
And you don't think they really stop tracking you just because you said you didn't like it, right? What would you even do against it? Sue each and every website owner on the planet somewhere in the EU where they might not even care?
The circle with the X is weird design, because it looks like a button that's supposed to close something, but as long as you don't click it, the article is perfectly readable. You just need to resist your reflexes to click any X in sight.
hmm, I feel ARM performance is a bit like nuclear fusion. It's always the next generation that will deliver an order of magnitude performance increase. Yet some how ARM single core performance is still shit compared to x86. (No matter how much I hope and pray for that to change, cause x86 needs to die)
I'm afraid you're simply wrong here. Single core performance of the (now sadly deceased) Qualcomm Amberwing was incredible, easily Xeon-like. And it had 46 cores per socket. The Cavium ThunderX2 also has great single core performance coupled with many more cores than you can get on an Intel chip. Apples iPhone cores are also supposed to be good. (Edit: These are ARM cores, but not ARM-as-in-the-company designs)
Interesting, I was mostly thinking about the ARM's own cores, but never _really_ looked into the broader ecosystem.
From my personal experience, Apple chips are fast, but I can't really compare that to anything else. JS benchmarks match desktop performance, but that is not really what I was/am interested in.
I've tried some 'cloud ARM-based metal servers' which promised to have similar performance as intel Atom cpu's basically, but they felt at least 10 times slower. So I gave up on the whole concept.
But I guess the take-away is, those specific systems where slow not so much ARM in general. I mean, it does make sense that there is some scale difference between a Xeon core vs a core meant for mobile use now that I actually think about it ;)
I benchmarked few image processing algorithms (opencv) on my iphone xs. I was surprised processing was faster than on my quadcore i7 macbook pro 2012. Sure cpu is 6 generation behind but there haven't been much performance improvements in the last 7 years on x64. Probably around + ~50% single thread.
Image processing is probably offloaded to some sort of custom silicon. If that is the case, then the generational performance uplift would be massive.
Consider this:
Decoding a big fat X265 file on your 2012 MBP would be atrocious because it doesn't have hardware X265 decode, but decoding it on your XS would probably be fine.
That could be true for only some algorithms but I have investigated the source code as well and on ARM it uses only NEON instructions if available (but its equivalent of SSE and AVX2 on x86 anyway).
The result is the same with my own crafted image processing algorithms using single core. Apple makes really fast CPUs.
Not that much. Each generation there was maybe 7% perf improvement so in 6 years: 1.07^6=1.50. Improvement was mostly with adding more cores.
This ~1.5x in single thread performance improvement matches with stats from geekbench.com benchmark:
my macbook pro 2012 cpu:
i7-3615QM => 3093 single thread [1]
equivalent current gen cpu with same TDP (45W) and similar clock (2.3GHz):
i7-8750H => 4617 single thread [2]
Before or after meltdown, spectre, l1tf, foreshadow, zombieload, ridl, and fallout patches? I would honestly guess that these new round have cost a huge amount of progress. I certainly won't be applying the latest three; I spent too much on a nice machine to have it perform as though it were four years old.
Is this true for comparable TDP processors? Speaking as the owner of one of the few x86 Android phones, the Intel performance isn't better on small devices.
An iPad Pro's single-threaded geekbench performance is better than the 8th-gen i7 in my laptop, according to geekbench. Multithreaded is better on my laptop. This is comparing the iPad to just the CPU; my actual model is almost certainly better (particularly if you consider thermals). This comparison is also with Apple's additional (not inconsiderable) design work on top. I also sincerely doubt an iPad could sustain this. But the point is the same: an a12x macbook pro could be a serious contender.
Tensorflow has the XLA compiler for targeting accelerators. In practice the integration isn't necessarily amazing right now, but the plan is to make it much better and able to generalize more easily.
The openCL tensorflow issue is still open and no Google dev has shown interest..
Yet I'm curious about your progress on this and when this could target AMD / Intel / ARM
The day Apple releases their first ARM powered laptop, will will be a turning point. This comment written on an ipad, the best product to come out of apple to yhis day.
While I don't agree with the first part, I do agree that the iPad is unique and worthy being the only Apple device my hard earned money has ever gone towards.
I have heard nothing but universal praise for the Pro model. But no point in upgrading just to upgrade - the beautiful thing about Apple products is that they seem to work forever.