This is interesting to me, lots of noise around "machine learning" in the GPU ra...

dragontamer · on May 27, 2019

GPUs are the 1980s style SIMD compute system. A GPU is designed to run many, many, many threads. A Vega64 for example, can run 163,840 threads at the same time (4096 shaders x 10-way SMT x 4-threads per unit)... while a CPU Threadripper 2950x can "only" run 32 threads at once (16 cores x 2-way SMT).

However, the GPU-shader is far more constrained than a CPU-thread. A GPU-shader is "locked" to performing the same tasks as other cores. That is to say: 64 "GPU threads" on AMD systems have the same program counter / instruction pointer.

32 "GPU threads" on NVidia systems have the same program counter.

This means that if one thread loops 10,000 times in a program, then that will force 32 other threads (on NVidia) or 64 other threads (on AMD) to loop 10,000 times (even if those threads individually only need to loop 500 times).

This restriction doesn't really matter in cases of matrix multiplication: where all threads would loop the same number of times. But in cases of like, Chess Minimax algorithm, the "thread divergence" makes it difficult to actually port Chess Minimax to a GPU.

-------

Machine learning... (EDIT: or more specifically, Convolutional Neural Networks) is almost purely a matrix multiplication problem, making it ideal for the SIMD GPU architecture.

CPUs have SIMD units by the way (SSE, AVX, etc. etc.), but GPUs are specialized systems that can only do SIMD. So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit.

-------

EDIT: Google's TPU goes beyond the GPU-model (SIMD architecture) and has dedicated matrix-multiplication units. These units can only perform matrix multiplication, nothing else, and are designed to do it as quickly and power-efficiently as possible.

NVidia's "Turing" line of GPUs have dedicated matrix multiplication units called "tensor units". NVidia basically grafted a dedicated FP16 matrix multiplication unit onto their GPUs.

The_rationalist · on May 27, 2019

"So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit." Except if memory latency transfer is a bottleneck. I've seen many examples of SIMD beating GPU e.g pathfinder v3 text renderer

slaymaker1907 · on May 27, 2019

While memory latency transfer between the CPU and GPU is an issue, GPUs have absolutely incredible throughput when moving things from one place on the GPU somewhere else on the GPU. This is why researchers are looking at GPUs for running graph algorithms.

pavlov · on May 27, 2019

It’s interesting that the cycle is complete and we’re back to the 1975 situation: “Wouldn’t it be great to have a personal computer that could do all these things we currently do by connecting to a mainframe?”

Only the PDP-10 has been replaced by Google.

dragontamer · on May 27, 2019

All modern computers, even cell phones, have the capabilities of mainframes from the 1970s and 1980s.

Back in the 1980s, a personal computer was loaded with an operating system that could only support a single user. A "mainframe" was a system that could support more than one user at a time, and you'd connect to it using telnet.

Today, your Linux PC loaded with SSH effectively functions as a mainframe. Heck, you can load virtual machines that themselves can pretend to be mainframes from the 1970s. Heck... your cell phone uses this feature and pretends that apps are different users for maximum security.

----

In contrast: GPUs are simply the 1980s style SIMD super-computers. It turns out SIMD Supercomputer style was the fastest way to calculate where pixels go on the screen (Conceptually: each pixel-shader basically gets to run its own thread: so a 1920x1080 screen has 2073600 threads to run per frame)

The best architecture to run millions of simple threads is a GPU or SIMD-computer (See https://en.wikipedia.org/wiki/Connection_Machine)

> The CM-1, depending on the configuration, has as many as 65,536 individual processors, each extremely simple, processing one bit at a time.

tempguy9999 · on May 27, 2019

> GPUs are simply the 1980s style SIMD super-computers

May be a terminology thing, or a technology I don't know, but I though '80s supercomputers like crays were vector processors, whereas I'm not aware of any commercial super at the time being SIMD (AKA array processor).

Joky · on May 27, 2019

I'm not super familiar with the distinction you're making ("array processor" vs "vector processor"), do you have any pointer I could look at?

tempguy9999 · on May 27, 2019

It seems to be a terminology confusion of mine. I did a bit of reading and it seems they're effectively the same thing implemented differently. An array processor in my mind was a lockstep SIMD thingy.

A vector processor as I understood it is nicely described here <https://www.quora.com/What-is-meant-by-an-array-processor-an... as "the earliest Crays had vector instructions, that quickly fed a stream of operands through a pipelined processor", which is not exactly the same implementation as SIMD, but after reflecting on your question, it does the same thing in the end.

So effectively no difference.

Also relevant <https://arstechnica.com/civis/viewtopic.php?t=401649> see reply of Accs who claims to have worked at cray, and also says in another post on the same thread "I think that Vector is nothing more than a special case of SIMD".

So, my error, hope above helps.

FYI you might want to read up on systolic arrays just for fun <https://en.wikipedia.org/wiki/Systolic_array>.

@tntn: thanks, my carelessness.

tntn · on May 27, 2019

CM-1, linked in 'dragontamer's comment.

Demiurge · on May 27, 2019

It is a wave however, rather than a circle. There are oscillations, back and fourth between all sorts of things. It is inevitable because the competing ideas and systems are oscillating, all the way down. Nature is not perfectly in sync, and that is indeed what makes things interesting.

swombat · on May 27, 2019

Usually the most apt metaphor is a spiral. We’ve gone round in a big circle, it feels like we’re back in the same place, except everything is completely different due to movement in a dimension which we didn’t even perceive.

And, well, you know, “words”... whatever works :-D

mycall · on May 27, 2019

https://www.youtube.com/watch?v=imdFhDbWDyM

amelius · on May 27, 2019

Except back then the driver wasn't privacy.

dahfizz · on May 27, 2019

Is privacy the only driver? I think convenience, speed, and cost are still big concerns, just like they were back in the day.

Network requests are always very slow when compared to processor speed. You can buy a device for 1/500 the cost of a server, put it in your closet, and it will be faster at almost every task than offloading to the "cloud".

I work tangential to the Telecom space, and all anyone cares about is the "edge" i.e. Consumer Premise Equipment and the like. With how cheap computer power is and how high network latency is, having centralized datacenters has become less and less desirable.

pavlov · on May 27, 2019

Yes, it’s more like a cycle of reincarnation: the same problems and solutions keep coming back in different guises.

HeWhoLurksLate · on May 27, 2019

And those problems and solutions are also somewhat caused by / solved by more modern technology.

gok · on May 27, 2019

The recent ML-focused chips are really just high efficiency, low precision matrix multipliers. It turns out that operation can be made ~10x more efficient than previous processor designs, and is the bottleneck in inference for modern neural network models. Specialized hardware can also accelerate the training somewhat.

So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.

But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.

vbezhenar · on May 27, 2019

What I always wonder is that AI requires data to learn from. When Google sends everything to their servers, they have that data from all over the world. So they can build better AI. If their competitors implement privacy-respecting solutions, they are doomed to be behind Google, because they don't have data to learn their AI. My personal anecdote: my speaking English is quite bad and Apple Siri does not understand me well. But Google Assistant does wonders, it's fascinating how well its speech recognition works.

I have a feeling that AI technology is inherently anti-privacy.

shpx · on May 27, 2019

Gboard (Google's keyboard on Android) has had voice recognition without the cloud for three months now, no extra work from ARM needed.

https://techcrunch.com/2019/03/12/googles-new-voice-recognit...

Looks like they're not uploading clips to themselves either, at least it doesn't show up in https://myactivity.google.com

They still store audio when you use Ok Google though.

freyr · on May 27, 2019

Google is rolling out on-device voice recognition too. They’ve integrated it into Gboard:

https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...

fpgaminer · on May 27, 2019

> Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.

Yes, powerful ARM embedded platforms are making this more and more possible and I'm at least excited about it.

Not only that, speech recognition and speech synthesis technologies are getting more efficient by the month. There was a paper just this past week (https://arxiv.org/pdf/1905.09263.pdf) which demonstrates speech synthesis with an order of magnitude faster performance. Advances like those make it easier to cram these technologies on local IoT platforms.

And there's certainly been a number of local focused AI projects over the past few years. I remember Snips.ai being decent for building local-only voice assistants. It worked on a Raspberry Pi, though you still have to hack together decent microphones for it.

All of this gives me hope for an open source voice assistant that will be competitive with the commercial options.

Now if the makers of smart accessories (lights, etc) would stop being anti-consumer for two minutes we might get decent accessories that don't require a cloud.

jaytaylor · on May 27, 2019

Were you able to try it out yourself? Curious how good it is compared to "OK Google".

P.S. Chuck, you are a great inspiration and friend. Thank you for all the kindness you've shown me.

dmix · on May 27, 2019

How much data would it require for a real-world voice "model" that is locally on a device, space-wise? I don't know much about machine learning's actual implementations within software, I've always been curious how you go from (say) Tensorflow -> embedded within real world software / APIs.

lrem · on May 27, 2019

Google Translate offline files seem to be about 50MB per language. Assume your speech recognition machinery breaks down utterances into phonemes, alphabet a couple times that of letters and some more entropy... My intuition is you should be in the 200-500MB range.

Disclaimer: I work in Google, but apparently have never touched any of this from the inside.

cpeterso · on May 27, 2019

Earlier this year, Google introduced an Android "Live Transcribe" feature that runs locally without the need for an internet connection:

https://www.theverge.com/2019/2/4/18209546/google-live-trans...

kyriakos · on May 27, 2019

And it's very accurate. I tested it with English and Greek for which usually dictation systems are very bad and was surprised how well it recognised every word.

tshaddox · on May 27, 2019

It doesn’t exactly answer your question, but Apple has a decent description of how local “Hey Siri” detection works on the iPhone and watch:

https://machinelearning.apple.com/2017/10/01/hey-siri.html

NotPaidToPost · on May 27, 2019

The limiting factor is generally not memory but processing and power consumption.

dmix · on May 27, 2019

Sure, my question was what the local algorithm/dataset looks like packaged up.

neysofu · on May 27, 2019

https://github.com/Picovoice/Porcupine

keerthiko · on May 27, 2019

I would hope at least chip designers/manufacturers will stop referring to chips designed and marketed specifically for AI/ML as "G"PUs, when they aren't really about graphics processing anymore.

dmitriid · on May 27, 2019

General Processing Units then? :)

rangibaby · on May 27, 2019

That's very clever, and reminds me of DVD (Digital Video Disc to Digital Versatile Disc)

pavlov · on May 27, 2019

Maybe “Parallel Processing Unit”, PPU?

earenndil · on May 27, 2019

PPU is used as 'Picture Processing unit', essentially the graphics card of older 2d consoles.

floatboth · on May 27, 2019

Also was "POWER Processing Unit" in the IBM Cell that was in the PlayStation 3.

orlp · on May 27, 2019

If a CPU is a Central Processing Unit then I guess we should call them DPUs: Distributed Processing Unit.

p1mrx · on May 27, 2019

Guided processing unit. The CPU tells it what to process.

amelius · on May 27, 2019

Yes, and hopefully that will stop the problems I get with video drivers when installing a GPU which I just use for raw computation. Also, my PC has 10 video outputs now, which I think is insane!

als0 · on May 27, 2019

In the case of ARM they seem to have announced Neural Processing Units (NPUs), which looks more like a mass market TPU.

ourlordcaffeine · on May 27, 2019

>Saw a system at Makerfaire that does voice recognition without the cloud

Voice recognition without the cloud has been possible for years, this is nothing new.

applecrazy · on May 27, 2019

You were at Maker Faire? Awesome, I'd love to hear your thoughts on the Faire in comparison with previous years.

mycall · on May 27, 2019

Nvidia's Jetson really is exciting what it can achieve.

HeWhoLurksLate · on May 27, 2019

As are AMD's newest announcements of Navi & Intel-smacking performance- they could probably lean over this way and jostle things around quite a bit if they wanted to.

ajxs · on May 27, 2019

> non-privacy-invading

> computers you could talk to

Pick one.

I can't help but feel somewhat pessimistic about technological developments in the handheld device domain. I say this as a supporter of ARM though. I have no reason to dislike ARM, I'm just somewhat apprehensive regarding any more empowerment of the companies we keep in our pockets.

Majestic121 · on May 27, 2019

Talinkg to a computer is an interface like any other (mouse/keyboard). There's no fundamental reason why talking to a computer would be inherently privacy invading, it just happens that most implementations today are, for technical/performance reasons.

ajxs · on May 27, 2019

Of course, I understand what an interface is. There's nothing inherently invasive about this kind of interface, except that...

> ...it just happens that most implementations today are...