More

smilekzs · 2025-12-08T05:12:51 1765170771

Or, as it has always been, piecewise linear approximate it.

smilekzs · 2025-12-03T06:59:50 1764745190

Single-chip specs according to:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/abou...

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/...

Eight NeuronCore-v4 cores that collectively deliver:

    2,517 MXFP8/MXFP4 TFLOPS
    671 BF16/FP16/TF32 TFLOPS
    2,517 FP16/BF16/TF32 sparse TFLOPS
    183 FP32 TFLOPS

HBM: 144 GiB HBM @ 4.9 TB/sec (4 stacks)

SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)

Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Math™)

----

At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).

The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).

smilekzs · 2025-11-27T23:14:31 1764285271

This is quite accurate considering Google TPUs are VLIW machines.

smilekzs · 2025-11-27T23:07:55 1764284875

Correct --- found a remark on Twitter calling this "Jenson Math".

Same logic when NVidia quote the "bidirectional bandwidth" of high speed interconnects to make the numbers look big, instead of the more common BW per direction, forcing everyone else to adopt the same metric in marketing materials.

smilekzs · 2025-09-30T03:37:04 1759203424

Anecdote: 9 years ago I was at MSFT. Hands forced by long GC pauses, eventually many teams turned to hand-rolling their flavor of string_view in C#. It was literally xkcd.com/927 back then when you tried to interface with some other team's packages and each side has the same but different string_view classes. Glad to see that finally enjoying language and stdlib support.

ZeroConcerns · 2025-09-30T05:41:24 1759210884

(ReadOnly)Span<T> has been available for 8 years now, and even before that, in the legacy Framework, there were common readonly-ref string slicers.

Izikiel43 · 2025-09-30T17:17:30 1759252650

You are assuming most devs know about this.

I do check the standard library for things that sound like they should be there as their common enough. My experience tells me this approach is not as common as you would expect, same for C# in msft, I don’t know how many people using framework knew about array segment.

pjmlp · 2025-09-30T06:55:50 1759215350

ArraySegment for example.

smilekzs · 2025-09-14T23:36:57 1757893017

The image circle of this is APS-C sized => 1.5x crop factor => 75mm "full frame" equivalent.

I'd categorize this as more of a portrait lens (than "normal" as the 50mm moniker implies).

smilekzs · 2025-09-07T12:25:13 1757247913

Lidars have been reporting per-point intensity values for quite a while. The dynamic range is definitely not 1 bit.

Many Lidar visualization software will happily pseudocolor the intensity channel for you. Even with a mechanically scanning 64-line Lidar you can often read a typical US speed limit sign at ~50 meter in this view.

smilekzs · 2025-09-07T01:49:39 1757209779

Not OP but I think this could be an instance of leaky abstraction at work. Most of the time you hand-write an accelerator kernel hoping to optimize for runtime performance. If the abstraction/compiler does not fully insulate you from micro-architectural details affecting performance in non-trivial ways (e.g. memory bank conflict as mentioned in the article) then you end up still having per-vendor implementations, or compile-time if-else blocks all over the place. This is less than ideal, but still arguably better than working with separate vendor APIs, or worse, completely separate toolchains.

whimsicalism · 2025-09-07T02:38:04 1757212684

Yes, it looks like they have some sort of metaprogramming setup (nicer than C++) for doing this: https://www.modular.com/mojo

totalperspectiv · 2025-09-07T14:59:43 1757257183

I can confirm, it’s quite nice.

whimsicalism · 2025-09-07T17:56:10 1757267770

jw: why do you use mojo here over triton or the new pythonic cute/cutlass?

totalperspectiv · 2025-09-08T13:22:26 1757337746

Because I was originally writing some very CPU intensive SIMD stuff, which Mojo is also fantastic for. Once I got that working and running nicely I decided to try getting the same algo running on GPU since, at the time, they had just open sourced the GPU parts of the stdlib. It was really easy to get going with.

I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.

smilekzs · 2025-06-17T06:52:53 1750143173

> The active class is clearly redundant here. If you want to style based on the .active selector, you could just as easily style with [aria-selected="true"] instead.

I vaguely remember (from 10+ years ago) that class selectors are much more performant than property selectors?

nolanl · 2025-06-17T15:05:49 1750172749

Author here. I actually did some research on CSS selector performance: https://nolanlawson.com/2023/01/17/my-talk-on-css-runtime-pe...

The TL;DW is: yes, class selectors are slightly more performant than attribute selectors, mostly because only the attribute _names_ are indexed, not the values. But 99% of the time, it's not a big enough deal to justify the premature optimization. I'd recommend measuring your selector performance first: https://developer.chrome.com/docs/devtools/performance/selec...

robin_reala · 2025-06-17T07:51:04 1750146664

Given the morass of JS slathered on every site these days, selector performance is the least of your worries.

smilekzs · 2025-05-12T06:16:01 1747030561

From first principles I think the concept can make sense. From car-specific function-specific ECUs, to platform-shared (but still function-specific) ECUs, then to Zonal architecture and domain controllers. The goals: consolidate and generalize HW across the lineup moving model-specific bits to FW/SW/Config (amortizes the development cost and simplifies certification), and also simplify wiring (saves you precious copper wires which are costly, messy, and heavy) because you can pretty much just plug every miscellaneous sensor or actuator to its nearest "anchor point" without worrying (too much) about arbitrary ECU limitations.

See Rivian's intro on their ECU design and Zonal architecture: https://youtu.be/6ZBko4TvfJY?t=137&si=-SKL_iFqZFnHE8nQ

This might sound like purely implementation detail, but having the (non-safety-critical) "business logic" of a car as software gives the manufacturer flexibility to late-bind behavior as new use cases / demands inevitably get discovered.

Something can simultaneously be a good idea, buzzword'd by marketing, and/or deviate from the original intentions.