Short version: A Qwen-2.5 7b model that has been turned into a diffusion model.
A couple notable things: first is that you can do this at all, (left to right model -> out of order diffusion via finetuning) which is really interesting. Second, the final version beats original by a small margin on some benchmarks. Third is that it’s in the ballpark of Gemini diffusion, although not competitive — to be expected for any 7B parameter model.
A diffusion model comes with a lot of benefits in terms of parallelization and therefore speed; to my mind the architecture is a better fit for coding than strict left to right generation.
Overall, interesting. At some point these local models will get good enough for ‘real work’ and they will be slotted in at API providers rapidly. Apple’s game is on-device; I think we’ll see descendants of these start shipping with Xcode in the next year as just part of the coding experience.
Without having tried it, what I keep getting surprised with is how apparently widely different architectures (and in other cases training data) lead to very similar outcomes. I'd expect results to vary a lot more.
I would expect a lot of attempts to fail and those tend to not be published, or gather less attention. So if we have reached a local optimum, any technique that gets close to the current benchmarks is worth publishing, as soon as results reach that point. All the one that are too distant are discarded. In the end all the paper you see are close to the current status quo.
It's possible that some of those new architecture / optimization would allow us to go beyond the current benchmark score, but probably with more training data, and money. But to get money you need to show results, which is what you see today. Scaling remains king; maybe one of these technique is 2025 "attention" paper, but even that one needed a lot of scaling to go from the 2017 version to ChatGPT.
It doesn't look like it got pushed that much unfortunately. The article says they only added 20k examples to fine tune at the end, but maybe the ceiling is much higher for diffusion?
But yeah, RWKV also ends up in a similar performance area with similar sizes - I wish someone started using it at scale finally...
The data might be the limiting factor of current transformer architectures, but there's no reason to believe it's a general limiting factor of any language model (e.g. humans brains are "trained" on orders of magnitude less data and still generally perform better than any model available today)
When we look at the small models suitable for running locally, by far the best programming model is DeepSeek-R1-0528-Qwen3-8B. It is quite comparable in real world usage even to much bigger models.
> A diffusion model comes with a lot of benefits in terms of parallelization and therefore speed; to my mind the architecture is a better fit for coding than strict left to right generation.
I had a similar notion and am excited to see this research being done. My experience of writing code is that the structure of the whole system influences each individual part, which has always felt like a better match for a diffusion type model.
I’m suspecting this is a 7B model because it’s an experiment, but I do like seeing Apple playing with smaller models - I think Google’s “no moat” memo is still fundamentally correct, either via better architectures or Moore’s law, and it seems like Apple thinks the same.
The "no moat" memo is way more complex than Google admitting an uncomfortable truth. The benefit massively from having seemingly internal documents leaked about how the play field is fair.
> to my mind the architecture is a better fit for coding
We have to see if it produces better results. Humans have a planning phase, followed be a part-by-part implementation phase. This is reasonably well emulated by plan/architect + codegen tools.
It's delusional to think that most software projects can be planned in advance beyond "there will be a beginning, a middle, and an end". People do it, but their efforts are in my experience generally ignored once implementation get underway.
Planning in software isn’t about following the plan but mapping a viable route to avoid predictable issues. You’re always going to know more about a project as you build it and you should keep updating that plan.
That’s true at the project level. But surely when you sit down to actually work for a couple hours you think about what you are going to do, and then mostly do that.
In my experience it’s more fractal. Any subgoal, however small, may run into its own planning/thinking and then doing sequence, or even have you reconsider the higher-level plan. Of course, it somewhat depends on how run-of-the-mill the overall task is.
They predict more than just the second half of a word you are typing, but at the end of the day they're still just predicting what a human would have typed.
Most of the "magic" of large models are really just function calls, so as long as the small models have access to the same functions they work well. They fixed the "how many R's in Strawberry" issue by offloading the question to a function, not spending a godly amount of money/energy on training another model.
Oops, sorry "Tools". Gotta maintain the grift these statistic based lossy text compression cool bar tricks are "thinking".
Why can’t I backup an iOS device to a local NAS in the way I can use Time Machine, for example? (Rhetorical question; the answer is obviously that they want to sell more iCloud storage for that all-important services revenue).
The number of people that would but a NAS over just spending the 5$/month for storage is well below a percent and if you combine that with the requirement of not having a PC/Mac you may well end up in the hundreds…
There aren’t that many people that are willing to own a device from a company but not trusting that company with their data
I'm willing to bet that more people would backup their Android device if Google provided a first party tool for user friendly backups of Android devices to local computers.
Finder -> Locations -> Your iPhone -> Backup all the data on this iPhone to your Mac.
Once you have done this you can find the backup in "Manage Backups", right click on an entry and select "Show in Finder". From there you can copy it to your NAS.
Not as smooth as a Time Machine backup but it is possible.
That’s a guide on how to backup an iPhone to a NAS using a computer.
Unsurprisingly, a reasonably capable general-purpose OS supports network file systems in a way transparent to applications, but that doesn’t help people using only an iOS device.
Did you actually read what you linked, or did you just paste in a random link from a search engine?
There are two methods presented: one only backs up the camera roll; the other requires plugging into a computer and manually clicking around, at which point you might as well use the first party backup built into Finder (or iTunes on Windows? Is that still a thing?), no random third party application needed. I also highly doubt their “backup every single content” claim.
It’s also a sneaky marketing article for that third party application, following the common SEO practice of giving you a half-ass solution capturing a frequent search term (in this case, “backup iPhone to Synology”), then plug their own questionable thing as the better solution.
> I think Apple will ultimately destroy the data center
I think EVs destroying Ultra Large Container ships had better odds, amd both are extremely unlikely. Dc advantages Apple won't be able to overcome: compute density, cooling, cheap power, physical security to protect the software, scale + bandwidth, lower costs to customers of using contract manufacturers and/or commodity hardware.
There is no universe where large enterprises ditch their geo-located racks. Let alone hyperscalers, especially now that they are scrounging for energy, reneging on pledges on renewables, and paying bug bucks to bring nuclear power stations online
It’s easy to imagine a universe where the hyperscalers are in a bubble and they will eventually find a limit to adding classical compute and we will hit peak datacenter and shrink from there.
Not without fundamentally changing the way they think about computing and there seems to be zero willingness among their leadership to do that. In fact they seem to want to move things into the data center. That's why I'm shorting them.
A couple notable things: first is that you can do this at all, (left to right model -> out of order diffusion via finetuning) which is really interesting. Second, the final version beats original by a small margin on some benchmarks. Third is that it’s in the ballpark of Gemini diffusion, although not competitive — to be expected for any 7B parameter model.
A diffusion model comes with a lot of benefits in terms of parallelization and therefore speed; to my mind the architecture is a better fit for coding than strict left to right generation.
Overall, interesting. At some point these local models will get good enough for ‘real work’ and they will be slotted in at API providers rapidly. Apple’s game is on-device; I think we’ll see descendants of these start shipping with Xcode in the next year as just part of the coding experience.