More

jakogut · 2026-03-05T16:32:34 1772728354

You in fact can now! In the past week, a transformer framework called FastFlowLM [0] supporting XDNA 2 NPUs officially started supporting Linux.

I posted it here the same day I found and started using it, to almost no reaction.

[0] https://github.com/FastFlowLM https://fastflowlm.com/ https://huggingface.co/FastFlowLM

giancarlostoro · 2026-03-05T18:21:42 1772734902

> to almost no reaction.

HN is overloaded with AI stuff, its hard to break through all the noise. I say this as someone very interested in AI. Even I skip some links because its just too much.

wing-_-nuts · 2026-03-05T21:05:15 1772744715

I see it making claims about 10x efficiency, but how is tokens / second / watt? The only machines I've seen with the memory bandwidth to effectively do local inference are Mx arm chips on mac.

vyr · 2026-03-05T17:05:47 1772730347

because it's not faster than the Ryzen 395's GPU. power efficiency doesn't matter as much as TTFT for desktop users, especially when they're tasking their AMD box as a dedicated inference machine.

some older pre-395 AMD articles suggested it'd be possible to use the NPU for prefill and the GPU for decoding and this would be faster than using either alone, but we have yet to see that (even on Windows) for any usefully sized models, just toys like LLaMA-8B.

jakogut · 2026-03-03T18:37:40 1772563060

On average according to Geekbench, the M5 compared to the 9950X is ~17% faster in single thread performance and ~30% slower in multithread performance.

Individual benchmarks tell the bigger picture. These two are optimized for different use cases, with Apple heavily leaning towards low latency single thread throughput with low sustained power usage.

https://browser.geekbench.com/v6/cpu/compare/16833358?baseli...

EDIT: The M4 Max compares much more closely https://browser.geekbench.com/v6/cpu/compare/16834801?baseli...

pdpi · 2026-03-03T20:19:48 1772569188

That M4 Max is in a laptop. The Mac Studio version is a couple percent faster still:

https://browser.geekbench.com/v6/cpu/compare/16839304?baseli...

The M3 Ultra sacrifices a bunch of single-thread performance for not that much of a multithreaded gain:

https://browser.geekbench.com/v6/cpu/compare/16839654?baseli...

guerrilla · 2026-03-03T18:43:07 1772563387

Alright, thanks. Seems like a tradeoff issue.

jakogut · 2026-02-26T20:47:32 1772138852

There's also a guide on the Framework community forums for getting this built and running on Arch Linux.

https://community.frame.work/t/guide-use-npu-xdna2-with-arch...

jakogut · 2026-02-21T19:41:28 1771702888

I think the point of the line of questioning is to illustrate that "tools" like a code interpreter act as scratch space for models to do work in, because the reasoning/thinking process has limitations much like our own.

jakogut · 2026-02-19T15:58:30 1771516710

There's a "newer" (almost ten years old) Unigine benchmark available called Superposition you should check out.

https://benchmark.unigine.com/superposition

theandrewbailey · 2026-02-19T23:53:29 1771545209

I'm aware. I work in ewaste recycling, and most of the machines I come across are about 10 years old. I'm also a fan of JayzTwoCents. https://m.youtube.com/watch?v=ukb5tlT4IuQ

jakogut · 2026-02-16T03:48:13 1771213693

For anyone who doesn't know yet, there are a wide variety of ONVIF supporting cameras that you can setup with a local NVR running Frigate. You can block internet access to the cameras, so they can't create outbound connections, and only inbound connections to the video streams are allowed.

Tailscale has a free tier that's a good option to remotely access your network and cameras.

jakogut · 2026-02-11T19:43:52 1770839032

As a former user of Zulip at a previous company, thank you for this software, I enjoyed using it. Maybe I'll setup a private instance for friends and family so I can enjoy it once again.

jakogut · 2025-12-26T05:12:40 1766725960

There's a difference between carrying ten pounds small distances for short durations, and carrying an extra two pounds over twenty hours of travel, across multiple connecting international flights in a single day. It's also not just an extra two pounds, it's an additional proprietary power cord, bulk, more mass moving in and out under an airliner seat, it all adds up. Especially when you're sleep deprived and physically exhausted.

Dylan16807 · 2025-12-26T05:40:20 1766727620

Any amount of weight is annoying after that long, but if the extra laptop weight is reduced to 10% of your 25 pound bag then it's even less able to be the deciding factor between "portable" and "barely portable".

jakogut · 2025-12-04T16:07:42 1764864462

The Radeon RX 9070 XT performs at a similar level to the RTX 5070, and is retailing around $600 right now.

adrift · 2025-12-05T14:34:09 1764945249

Unfortunately, AMD drivers are beyond terrible and you'll experience frequent timeouts.

the__alchemist · 2025-12-04T16:13:43 1764864823

No CUDA means not an option for me.

the__alchemist · 2025-12-04T16:22:24 1764865344

> What kinds of applications do you use that require CUDA?

Molecular dynamics simulations, and related structural bio tasks.

vlovich123 · 2025-12-04T18:39:37 1764873577

Is the CUDA compat layer AMD has that transparently compiled existing CUDA just fine insufficient somehow or buggy somehow? Or are you just stuck in the mindshare game and haven’t reevaluate whether the AMD situation has changed this year?

the__alchemist · 2025-12-04T19:58:46 1764878326

I haven't checkout out AMD's transparency layer and know nothing about it. I tried to get vkFFT working in addition to cuFFT for a specific computation, but can't get it working right; crickets on the GH issue I posted.

I use Vulkan for graphics, but Vulkan compute is a mess.

I'm not in a mindshare, and this isn't a political thing. I am just trying to get the job done, and have observed that no alternative has stepped up to nvidia's CUDA from a usability perspective.

vlovich123 · 2025-12-04T20:48:54 1764881334

I didn’t talk about Vulkan compute.

> have observed that no alternative has stepped up to nvidia's CUDA from a usability perspective.

I’m saying this is a mindshare thing if you haven’t evaluated ROCm / HIP. HIPify can convert CUDA source to HIP automatically and HIP is very similar syntax to CUDA.

the__alchemist · 2025-12-04T21:13:43 1764882823

TY; will check those out.

jakogut · 2025-12-04T21:21:45 1764883305

There's also ZLUDA, which can run llama.cpp and some other CUDA workloads already without any modification, but it's still maturing.

jakogut · 2025-12-04T16:19:59 1764865199

What kinds of applications do you use that require CUDA?

jakogut · 2025-11-21T23:50:03 1763769003

Note that I have retained the original title of the post, but I am not the author.