Hacker News new | past | comments | ask | show | jobs | submit | ashvardanian's comments login

> Per an analysis by TD Cowen cited by Bloomberg, hiked prices for server racks, cooling systems, chips, and other components could contribute to overall build cost rises of 5-15% on average.

I can agree that infrastructure generally has much lower margins than software businesses, but I don't know of any mega-projects that wouldn't turn out costlier during construction, and 15% is hardly something that stopped them.


The article mostly focuses on the 2008-2014 era.

Yes. It is part of a series in which I cover Shockley -> Fairchild -> Intel, up to last month.

It might be clearer to link the original blog post (https://cognition.ai/blog/kevin-32b) rather than the Hugging Face model card.

The charts and formulas are attractive, but the CUDA snippet feels quite elementary.


If I recall correctly from the recent YC interview, the Windsurf founder noted their team leans more toward GTM than engineering. That makes this less likely to be a classic acquihire (as with Rockset) and more plausibly a data play rather than a product integration.

My current read is that this is a frontier lab acquiring large-scale training data—cheaply—from a community of “vibe coders”, instead of paying professional annotators. In that light, it feels more like a “you are the product” scenario, which likely won’t sit well with Windsurf’s paying customers.

Interesting times.


Agreed. It seems like a data play and a hedge to beef up vibe code competition against upcoming Google and MS models so OpenAI doesn't lose API revenue. I would assume vibe coding consumes more tokens than most other text based API usage.

Of all the billion-scale investment and acquisition news of the last 24 hours this is the only one that makes sense. Especially after the record-breaking $15B round, that Databricks closed last year.

The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .

Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.


Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it. The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.

Also, StringZilla looks amazing -- I just became your 1000th Github follower :)


Thanks, appreciate the gesture :)

Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?


I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108

As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).



Has anyone done/shared a recent benchmark comparing JNI call latency across Java runtimes? I’m exploring the idea of bringing my strings library to the JVM ecosystem, but in the past, JNI overhead has made this impractical.

Java has replaced JNI with the Project Panama FFM, which depending on your use case might perform quite a bit better than JNI used to. The Vector API is stuck in incubator and still a bit rough around the edges though, so SIMD might be a bit trickier.

Can you share a link to your "strings library"? I am curious about what it can do that a Java String cannot.

At this point, it doesn’t provide much novel functionality, but it should be faster than the standard libraries of most (or maybe all) programming languages.

https://github.com/ashvardanian/StringZilla


I can't speak for the whole industry, but we used it in older UForm <https://github.com/unum-cloud/uform> and saw good adoption, especially among those deploying on the Edge, where every little trick counts. It's hard to pin down exact numbers since most deployments didn't go through Hugging Face, but at the time, these models were likely among the more widely deployed by device count.


There is a "hush-hush open secret" between minutes 31 and 33 of the video :)


TL;Dr same binary runs on Nvidia and ATI today, but not announced yet


`#embed`, finally! Fixing UB in range-based `for` loops is also a good one!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: