> Local, app embedded, and purpose-built targeted experts is clearly the future in my mind for a variety of reasons. Looking at TPUs in Android devices and neural engine in Apple hardware it's pretty clear.
I think that’s only true for delay-intolerant or privacy-focused features. For most situations, a remote model running on an external server will outperform a local model. There is no thermal, battery or memory headroom for the local model to ever do better. The cost being a mere hundred milliseconds delay at most.
I expect most models triggered on consumer devices to run remotely, with a degraded local service option in case of connection problems.
Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years). For something like the RAG approach I mentioned the sync and coordination of your local content to a remote API would be more taxing on the battery just in terms of the radio than what we already see from on device neural engines and TPUs as leveraged by the functionality I described.
These applications would also likely be very upload heavy (photo/video inference - massive upload, tiny JSON response) which could very likely end up taxing cell networks further. Even RAG is thousands of tokens in and a few hundred out (in most cases).
There's also the issue of Nvidia GPUs having > 1 yr lead times and the exhaustion of GPUs available from various cloud providers. LLMs especially use tremendous resources for training and this increase is leading to more and more contention for available GPU resources. People are going to be looking more and more to save the clouds and big GPUs for what you really need to do there - big training.
Plus, not everyone can burn $1m/day like ChatGPT.
If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
There will likely always be some sort of blend (with serious heavy lifting being cloud, of course) but it's going to shift more and more to local and on-device. There's just no other way.
> Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years)
But those are peanuts compared to what will be possible in the (near) future. You think content-aware fill is neat? Wait until you can zoom out of a photo 50% or completely change the angle.
That’ll costs gobs of processing power and thus time and battery, much more than a 20MB burst transfer of a photo and the backsynced modifications.
> If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
It’ll definitely create a large moat around companies with lots of money or extremely efficient proprietary models.
> That’ll costs gobs of processing power and thus time and battery
The exact same thing was said about the functionality we're describing yet there it is. Imagine describing that to someone in 2010 who's already complaining about iPhone battery life. The response would be carbon-copy to yours.
In five years from the iPhone 8 to the iPhone 14 TOPS on the neural engine went from 0.6 to 17[0]. The iPhone 15 more than doubled that and stands at 35 TOPS[1]. Battery life is better than ever and that's a 58x gain just in neural, not even GPU, CPU, performance cores, etc.
Over that same period of time Nvidia GPUs only increased about 9x[2] - they're pushing the fundamentals much harder as a law of large numbers-ish issue.
So yeah, I won't have to wait long for zoom out of a photo 50%, completely change the angle, or who knows what else to be done locally. In fact, for these use cases increasingly advanced optics, processing, outside visual range sensors, etc, etc makes my point even more - even more data going to the cloud when the device is best suited to be doing it anyway.
Look at it this way - Apple sold over 97 million iPhones in 2023. Assuming the lower averages that's 1,649,000,000 combined TOPS out there.
Cloud providers benefit from optimization and inherent oversubscription but by comparison Nvidia sold somewhere around 500,000,000 TFLOPS worth of H100s last year.
Mainframe and serial terminal to desktop to thin client and terminal server - around and around we go.
I think that’s only true for delay-intolerant or privacy-focused features. For most situations, a remote model running on an external server will outperform a local model. There is no thermal, battery or memory headroom for the local model to ever do better. The cost being a mere hundred milliseconds delay at most.
I expect most models triggered on consumer devices to run remotely, with a degraded local service option in case of connection problems.