Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is interesting to me, lots of noise around "machine learning" in the GPU rather than graphics which kind of validates Google's TPU, Apple's AI co-processors, and Nvidia's Jetson. Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.



GPUs are the 1980s style SIMD compute system. A GPU is designed to run many, many, many threads. A Vega64 for example, can run 163,840 threads at the same time (4096 shaders x 10-way SMT x 4-threads per unit)... while a CPU Threadripper 2950x can "only" run 32 threads at once (16 cores x 2-way SMT).

However, the GPU-shader is far more constrained than a CPU-thread. A GPU-shader is "locked" to performing the same tasks as other cores. That is to say: 64 "GPU threads" on AMD systems have the same program counter / instruction pointer.

32 "GPU threads" on NVidia systems have the same program counter.

This means that if one thread loops 10,000 times in a program, then that will force 32 other threads (on NVidia) or 64 other threads (on AMD) to loop 10,000 times (even if those threads individually only need to loop 500 times).

This restriction doesn't really matter in cases of matrix multiplication: where all threads would loop the same number of times. But in cases of like, Chess Minimax algorithm, the "thread divergence" makes it difficult to actually port Chess Minimax to a GPU.

-------

Machine learning... (EDIT: or more specifically, Convolutional Neural Networks) is almost purely a matrix multiplication problem, making it ideal for the SIMD GPU architecture.

CPUs have SIMD units by the way (SSE, AVX, etc. etc.), but GPUs are specialized systems that can only do SIMD. So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit.

-------

EDIT: Google's TPU goes beyond the GPU-model (SIMD architecture) and has dedicated matrix-multiplication units. These units can only perform matrix multiplication, nothing else, and are designed to do it as quickly and power-efficiently as possible.

NVidia's "Turing" line of GPUs have dedicated matrix multiplication units called "tensor units". NVidia basically grafted a dedicated FP16 matrix multiplication unit onto their GPUs.


"So a GPU's SIMD unit is inevitably going to be bigger, more parallel, and more efficient than a CPU's SIMD unit." Except if memory latency transfer is a bottleneck. I've seen many examples of SIMD beating GPU e.g pathfinder v3 text renderer


While memory latency transfer between the CPU and GPU is an issue, GPUs have absolutely incredible throughput when moving things from one place on the GPU somewhere else on the GPU. This is why researchers are looking at GPUs for running graph algorithms.


It’s interesting that the cycle is complete and we’re back to the 1975 situation: “Wouldn’t it be great to have a personal computer that could do all these things we currently do by connecting to a mainframe?”

Only the PDP-10 has been replaced by Google.


All modern computers, even cell phones, have the capabilities of mainframes from the 1970s and 1980s.

Back in the 1980s, a personal computer was loaded with an operating system that could only support a single user. A "mainframe" was a system that could support more than one user at a time, and you'd connect to it using telnet.

Today, your Linux PC loaded with SSH effectively functions as a mainframe. Heck, you can load virtual machines that themselves can pretend to be mainframes from the 1970s. Heck... your cell phone uses this feature and pretends that apps are different users for maximum security.

----

In contrast: GPUs are simply the 1980s style SIMD super-computers. It turns out SIMD Supercomputer style was the fastest way to calculate where pixels go on the screen (Conceptually: each pixel-shader basically gets to run its own thread: so a 1920x1080 screen has 2073600 threads to run per frame)

The best architecture to run millions of simple threads is a GPU or SIMD-computer (See https://en.wikipedia.org/wiki/Connection_Machine)

> The CM-1, depending on the configuration, has as many as 65,536 individual processors, each extremely simple, processing one bit at a time.


> GPUs are simply the 1980s style SIMD super-computers

May be a terminology thing, or a technology I don't know, but I though '80s supercomputers like crays were vector processors, whereas I'm not aware of any commercial super at the time being SIMD (AKA array processor).


I'm not super familiar with the distinction you're making ("array processor" vs "vector processor"), do you have any pointer I could look at?


It seems to be a terminology confusion of mine. I did a bit of reading and it seems they're effectively the same thing implemented differently. An array processor in my mind was a lockstep SIMD thingy.

A vector processor as I understood it is nicely described here <https://www.quora.com/What-is-meant-by-an-array-processor-an... as "the earliest Crays had vector instructions, that quickly fed a stream of operands through a pipelined processor", which is not exactly the same implementation as SIMD, but after reflecting on your question, it does the same thing in the end.

So effectively no difference.

Also relevant <https://arstechnica.com/civis/viewtopic.php?t=401649> see reply of Accs who claims to have worked at cray, and also says in another post on the same thread "I think that Vector is nothing more than a special case of SIMD".

So, my error, hope above helps.

FYI you might want to read up on systolic arrays just for fun <https://en.wikipedia.org/wiki/Systolic_array>.

@tntn: thanks, my carelessness.


CM-1, linked in 'dragontamer's comment.


It is a wave however, rather than a circle. There are oscillations, back and fourth between all sorts of things. It is inevitable because the competing ideas and systems are oscillating, all the way down. Nature is not perfectly in sync, and that is indeed what makes things interesting.


Usually the most apt metaphor is a spiral. We’ve gone round in a big circle, it feels like we’re back in the same place, except everything is completely different due to movement in a dimension which we didn’t even perceive.

And, well, you know, “words”... whatever works :-D



Except back then the driver wasn't privacy.


Is privacy the only driver? I think convenience, speed, and cost are still big concerns, just like they were back in the day.

Network requests are always very slow when compared to processor speed. You can buy a device for 1/500 the cost of a server, put it in your closet, and it will be faster at almost every task than offloading to the "cloud".

I work tangential to the Telecom space, and all anyone cares about is the "edge" i.e. Consumer Premise Equipment and the like. With how cheap computer power is and how high network latency is, having centralized datacenters has become less and less desirable.


Yes, it’s more like a cycle of reincarnation: the same problems and solutions keep coming back in different guises.


And those problems and solutions are also somewhat caused by / solved by more modern technology.


The recent ML-focused chips are really just high efficiency, low precision matrix multipliers. It turns out that operation can be made ~10x more efficient than previous processor designs, and is the bottleneck in inference for modern neural network models. Specialized hardware can also accelerate the training somewhat.

So having this hardware in your phone means that, for example, face recognition can run faster, and scene analysis during video recording takes less battery.

But the neural network inference speed is rarely the issue that requires both running models on servers and collecting user data. In your case of speech recognition, the problem is that training state of the art models requires 10s of thousands of hours of speech data. A recent Amazon paper actually used a million hours. For language models you need even more. Modern phones actually can perform speech recognition locally (put your phone in airplane mode, dictation still works) but it's using models trained using data from users that did talk to the server.


What I always wonder is that AI requires data to learn from. When Google sends everything to their servers, they have that data from all over the world. So they can build better AI. If their competitors implement privacy-respecting solutions, they are doomed to be behind Google, because they don't have data to learn their AI. My personal anecdote: my speaking English is quite bad and Apple Siri does not understand me well. But Google Assistant does wonders, it's fascinating how well its speech recognition works.

I have a feeling that AI technology is inherently anti-privacy.


Gboard (Google's keyboard on Android) has had voice recognition without the cloud for three months now, no extra work from ARM needed.

https://techcrunch.com/2019/03/12/googles-new-voice-recognit...

Looks like they're not uploading clips to themselves either, at least it doesn't show up in https://myactivity.google.com

They still store audio when you use Ok Google though.


Google is rolling out on-device voice recognition too. They’ve integrated it into Gboard:

https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...


> Saw a system at Makerfaire that does voice recognition without the cloud (MOVI) and I keep thinking having non-privacy-invading computers you could talk to would be a useful thing. Perhaps ARM will make that possible.

Yes, powerful ARM embedded platforms are making this more and more possible and I'm at least excited about it.

Not only that, speech recognition and speech synthesis technologies are getting more efficient by the month. There was a paper just this past week (https://arxiv.org/pdf/1905.09263.pdf) which demonstrates speech synthesis with an order of magnitude faster performance. Advances like those make it easier to cram these technologies on local IoT platforms.

And there's certainly been a number of local focused AI projects over the past few years. I remember Snips.ai being decent for building local-only voice assistants. It worked on a Raspberry Pi, though you still have to hack together decent microphones for it.

All of this gives me hope for an open source voice assistant that will be competitive with the commercial options.

Now if the makers of smart accessories (lights, etc) would stop being anti-consumer for two minutes we might get decent accessories that don't require a cloud.


Were you able to try it out yourself? Curious how good it is compared to "OK Google".

P.S. Chuck, you are a great inspiration and friend. Thank you for all the kindness you've shown me.


How much data would it require for a real-world voice "model" that is locally on a device, space-wise? I don't know much about machine learning's actual implementations within software, I've always been curious how you go from (say) Tensorflow -> embedded within real world software / APIs.


Google Translate offline files seem to be about 50MB per language. Assume your speech recognition machinery breaks down utterances into phonemes, alphabet a couple times that of letters and some more entropy... My intuition is you should be in the 200-500MB range.

Disclaimer: I work in Google, but apparently have never touched any of this from the inside.


Earlier this year, Google introduced an Android "Live Transcribe" feature that runs locally without the need for an internet connection:

https://www.theverge.com/2019/2/4/18209546/google-live-trans...


And it's very accurate. I tested it with English and Greek for which usually dictation systems are very bad and was surprised how well it recognised every word.


It doesn’t exactly answer your question, but Apple has a decent description of how local “Hey Siri” detection works on the iPhone and watch:

https://machinelearning.apple.com/2017/10/01/hey-siri.html


The limiting factor is generally not memory but processing and power consumption.


Sure, my question was what the local algorithm/dataset looks like packaged up.



I would hope at least chip designers/manufacturers will stop referring to chips designed and marketed specifically for AI/ML as "G"PUs, when they aren't really about graphics processing anymore.


General Processing Units then? :)


That's very clever, and reminds me of DVD (Digital Video Disc to Digital Versatile Disc)


Maybe “Parallel Processing Unit”, PPU?


PPU is used as 'Picture Processing unit', essentially the graphics card of older 2d consoles.


Also was "POWER Processing Unit" in the IBM Cell that was in the PlayStation 3.


If a CPU is a Central Processing Unit then I guess we should call them DPUs: Distributed Processing Unit.


Guided processing unit. The CPU tells it what to process.


Yes, and hopefully that will stop the problems I get with video drivers when installing a GPU which I just use for raw computation. Also, my PC has 10 video outputs now, which I think is insane!


In the case of ARM they seem to have announced Neural Processing Units (NPUs), which looks more like a mass market TPU.


>Saw a system at Makerfaire that does voice recognition without the cloud

Voice recognition without the cloud has been possible for years, this is nothing new.


You were at Maker Faire? Awesome, I'd love to hear your thoughts on the Faire in comparison with previous years.


Nvidia's Jetson really is exciting what it can achieve.


As are AMD's newest announcements of Navi & Intel-smacking performance- they could probably lean over this way and jostle things around quite a bit if they wanted to.


> non-privacy-invading

> computers you could talk to

Pick one.

I can't help but feel somewhat pessimistic about technological developments in the handheld device domain. I say this as a supporter of ARM though. I have no reason to dislike ARM, I'm just somewhat apprehensive regarding any more empowerment of the companies we keep in our pockets.


Talinkg to a computer is an interface like any other (mouse/keyboard). There's no fundamental reason why talking to a computer would be inherently privacy invading, it just happens that most implementations today are, for technical/performance reasons.


Of course, I understand what an interface is. There's nothing inherently invasive about this kind of interface, except that...

> ...it just happens that most implementations today are...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: