SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)
Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Math™)
----
At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).
The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).
Correct --- found a remark on Twitter calling this "Jenson Math".
Same logic when NVidia quote the "bidirectional bandwidth" of high speed interconnects to make the numbers look big, instead of the more common BW per direction, forcing everyone else to adopt the same metric in marketing materials.
Anecdote: 9 years ago I was at MSFT. Hands forced by long GC pauses, eventually many teams turned to hand-rolling their flavor of string_view in C#. It was literally xkcd.com/927 back then when you tried to interface with some other team's packages and each side has the same but different string_view classes. Glad to see that finally enjoying language and stdlib support.
I do check the standard library for things that sound like they should be there as their common enough. My experience tells me this approach is not as common as you would expect, same for C# in msft, I don’t know how many people using framework knew about array segment.
Lidars have been reporting per-point intensity values for quite a while. The dynamic range is definitely not 1 bit.
Many Lidar visualization software will happily pseudocolor the intensity channel for you. Even with a mechanically scanning 64-line Lidar you can often read a typical US speed limit sign at ~50 meter in this view.
Not OP but I think this could be an instance of leaky abstraction at work. Most of the time you hand-write an accelerator kernel hoping to optimize for runtime performance. If the abstraction/compiler does not fully insulate you from micro-architectural details affecting performance in non-trivial ways (e.g. memory bank conflict as mentioned in the article) then you end up still having per-vendor implementations, or compile-time if-else blocks all over the place. This is less than ideal, but still arguably better than working with separate vendor APIs, or worse, completely separate toolchains.
Because I was originally writing some very CPU intensive SIMD stuff, which Mojo is also fantastic for. Once I got that working and running nicely I decided to try getting the same algo running on GPU since, at the time, they had just open sourced the GPU parts of the stdlib. It was really easy to get going with.
I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.
> The active class is clearly redundant here. If you want to style based on the .active selector, you could just as easily style with [aria-selected="true"] instead.
I vaguely remember (from 10+ years ago) that class selectors are much more performant than property selectors?
The TL;DW is: yes, class selectors are slightly more performant than attribute selectors, mostly because only the attribute _names_ are indexed, not the values. But 99% of the time, it's not a big enough deal to justify the premature optimization. I'd recommend measuring your selector performance first: https://developer.chrome.com/docs/devtools/performance/selec...
From first principles I think the concept can make sense. From car-specific function-specific ECUs, to platform-shared (but still function-specific) ECUs, then to Zonal architecture and domain controllers. The goals: consolidate and generalize HW across the lineup moving model-specific bits to FW/SW/Config (amortizes the development cost and simplifies certification), and also simplify wiring (saves you precious copper wires which are costly, messy, and heavy) because you can pretty much just plug every miscellaneous sensor or actuator to its nearest "anchor point" without worrying (too much) about arbitrary ECU limitations.
This might sound like purely implementation detail, but having the (non-safety-critical) "business logic" of a car as software gives the manufacturer flexibility to late-bind behavior as new use cases / demands inevitably get discovered.
Something can simultaneously be a good idea, buzzword'd by marketing, and/or deviate from the original intentions.
reply