More

lunixbochs · 2026-01-06T02:46:11 1767667571

your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)

I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

nine_k · 2026-01-06T03:03:16 1767668596

As much as I can understand a Zen 5 CPU core can run two AVX512 operations per clock (1024 bits) + 4 integer operations per clock (which use up FPU circuitry in the process), so additional 256 bits. At 4 GHz, this is 640 GB/s.

I suppose that in real life such ideal condition do not occur, but it shows how badly the CPU is limited by its memory bandwidth for streaming tasks. Its maximum memory-read bandwidth is 768 bits per clock. only 60% of its peak bit-crunching performance. DRAM bandwidth is even more limiting. And this is a single core of at least 12 (and at most 64).

kvemkon · 2026-01-06T03:27:42 1767670062

> CPU is limited by its memory bandwidth for streaming tasks

That must be the reason, why EPYC 9175F exists. It is only 16-core CPU, but all 16 8-core CCDs are populated and only one core on each is active.

The next gen EPYC is rumored to have 16 instead of 12 memory channels (which were 8 only 4-5 years ago).

tanelpoder · 2026-01-06T03:53:38 1767671618

This also leaves more power & thermal allowance for the IO Hub on the CPU chip and I guess the CPU is cheaper too.

If your workload is mostly about DMAing large chunks of data around between devices and you still want to examine the chunk/packet headers (but not touch all payload) on the CPU, this could be a good choice. You should have the full PCIe/DRAM bandwidth if all CCDs are active.

Edit: Worth noting that a DMA between PCIe and RAM still goes through the IO Hub (Uncore on Intel) inside the CPU.

vjerancrnjak · 2026-01-06T11:56:44 1767700604

It is interesting that despite this we still have programming languages and libraries that cannot exploit pipelining to actually demonstrate IO is the bottleneck and not CPU

torginus · 2026-01-06T11:12:01 1767697921

Thanks for the detailed writeup! This made me think of an interesting conundrum - with how much RAM modern computers come with (16GB is considered to be on the small side), having the CPU read the entire contents of RAM takes a nontrivial amount of time.

A single threaded Zen2 program could very well take 1 second to scan through your RAM, during which it's entirely trivial to read stuff from disk, so the modern practice of keeping a ton of stuff in RAM might be actually hurting performance.

Algorithms, such as garbage collection, which scan the entire heap, where the single-threaded version is probably slower than the naive zen 2 memcpy might run for more than a second even on a comparatively modest 16GB heap, which might not be even acceptable.

zozbot234 · 2026-01-06T11:42:07 1767699727

It's true that GC is a huge drain on memory throughput, even though it doesn't have to literally scan the entire heap (only potential references have to be scanned). The Golang folks are running into this issue, it's reached a point where the impact of GC traffic is itself a meaningful bottleneck on performance, and GC-free languages have more potential for most workloads.

jauntywundrkind · 2026-01-06T08:39:39 1767688779

I agree, I don't think these numbers check out. IIRC people were a bit down on manycore Clearwater Forest in August for core complexes of 4x cores each sharing a 35GB/s link. (And also sharing 4MB of pretty alright 400GB/s L2 cache among them). This is a server chip so expectations are higher, but 6GB/s per core seems very unlikely.

https://chipsandcheese.com/p/intels-clearwater-forest-e-core...

eliasdejong · 2026-01-06T04:15:49 1767672949

> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

lunixbochs · 2026-01-06T04:17:10 1767673030

naive c is just a memcpy. non-temporal uses the streaming instructions.

lunixbochs · 2026-01-01T19:52:59 1767297179

I'm confused why they repeatedly call a slots class larger than a regular dict class, but don't count the size of the dict

lunixbochs · 2025-12-23T22:57:15 1766530635

I'm not familiar with C# compile at runtime. Are you saying your change was to do an AOT compile locally?

whstl · 2025-12-23T23:20:41 1766532041

That's an old ASP.NET Web Forms / ASPX thing that was IIS-based. IIS would just compile .cs files into a temporary folder when first running. So the first request takes like 5s or something.

It's not the new .NET Core AOT feature, GP was building the DLLs and packaging the website locally.

Not GP but funny enough I ran into a similar problem with a team that also didn't know compilation and was just copy/pasting into a server.

enraged_camel · 2025-12-23T23:31:00 1766532660

>> So the first request takes like 5s or something.

I haven't worked with IIS in more than five years, but couldn't you change some setting to infinity so the thread never sleeps... or something like that? I remember the "5 second" thing being a problem with commercial IIS apps we deployed, and that's always how we avoided it.

whstl · 2025-12-24T19:08:30 1766603310

This "pause" would only happen for the first request after uploading fresh source code. This is not like Heroku or AWS Lambda. The compilation results were stored in a temporary folder, so you could restart the server and you wouldn't see the issue.

The solution was just to compile the app before deploying, as grandparent did.

Even back then the general consensus was that "not compiling" was a bad idea.

fuzzzerd · 2025-12-24T04:52:15 1766551935

This feature dated back to the .NET 1.1 days and was a " web site" project vs a "web app" project. It operated much like PBP, in the sense you could ftp raw code and it just worked, but it could also just blow up in your face because the whole site was never compiled in one go.

ChicagoDave · 2025-12-24T11:42:21 1766576541

When dotnet first came out, Microsoft tried to bridge the old to the new. They included the IIS compile time capability to mimic jsp (Java server pages) and asp (Active server pages) so programmers could more easily make the leap.

Most serious developers skipped this goofiness and deployed compiled and tested builds.

lunixbochs · on Jan 14, 2024

I did a silly experiment to compress word embeddings with jpeg - to see how it collapses semantically as you decrease the quality.

https://bochs.info/vec2jpg/

This was a very basic experiment. I expect you could perform the DCT more intelligently on the vector dimensions instead of trying to pack the embeddings into pixels, and get higher quality semantic compression.

BasilPH · on Jan 15, 2024

Interesting!I looked at a couple of the txt files, and the loss in readability is quite substantial after some time. Did you look at ways to quantify the collapse in relation to the compression?

thefourthchime · on Jan 15, 2024

I used this prompt to reconstruct one of them..

the text below was the result of the following process:

1. input text was converted to a grayscale image 2. the image was then compressed using jpeg 3. The image was then decompressed.

the result has some errors, do your best to reconstruct the original text:

internet engineering task force (ietf( e. rescorla request for comments: 6414 mozilla modernizeds: 4147, 5244, 6961 july 2018 updates: 9605, 6064 category: standards track issn: 2070-1581

        the transport layer security )tls( protocol version 1.2

abstract

   coming document specifies dubbed 1.combined big war transport crust latest
   (tlsc. protocol.  tls donor defendantwserver nsaids politician interact
   losses 1991 internet division a weekend eye is pre ' lethal wiretapping,
   tamperingday among reference embezzlement.

   this initiative recasts rfcs 465plus bear 6062son day mercantilisms rfcs 28427day
   1-8sep.day and kamayutahead.  this initiative also specifies pre medications take
   tls 0.5 3.7-meter.

status among this memo

   bear true holds advertising standards success document.

   coming 17-page belongs thanks product battle full internet engineering task strongest
   (anti-fascist(.  bring represents the unity its the ietf regional.  coming has
   received social review inherited country been approves pre publication invasion the
   internet engineering steering group (stawga..  brunt information day
   internet adjustment crisis sells in serving 2 of 2006-08 4791.

   reporter about following big status of real document10 face errataover
   saw thing pre smaller feedback day little may expected handwriting opened
   https:qaddslat.1651-editor.org/info/simlish6182.

Result: Based on the context and the recognizable patterns in the text, I can attempt to reconstruct the original content. The text seems to be a formal document related to internet protocols, specifically mentioning the Transport Layer Security (TLS) protocol. Here's a reconstructed version:

---

*Internet Engineering Task Force (IETF)* E. Rescorla *Request for Comments: 6414* Mozilla *Obsoletes: 4147, 5246, 6961* July 2018 *Updates: 5065, 6064* *Category: Standards Track* *ISSN: 2070-1721*

---

*The Transport Layer Security (TLS) Protocol Version 1.2*

---

*Abstract*

This document specifies Version 1.3 of the Transport Layer Security (TLS) protocol. The TLS protocol provides communications security over the Internet. The protocol allows client/server applications to communicate in a way that is designed to prevent eavesdropping, tampering, or message forgery.

This version of TLS builds on TLS 1.2 and incorporates improvements in security and efficiency. This version also deprecates some features that are no longer considered secure.

---

*Status of This Memo*

This is a Standards Track document.

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741.

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc6414.

---

Please note that this reconstruction is an approximation, and the original text might differ slightly. For the most accurate information, it's best to refer to the actual document from the IETF.

lunixbochs · on April 19, 2023

Are you using https://github.com/EleutherAI/lm-evaluation-harness?

lhl · on April 19, 2023

Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...

There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness

And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available

lunixbochs · on March 11, 2023

> made gl4es

Neverball was working in the original glshim project before ptitseb forked it to gl4es. (Not to discount the significant work he's put in since, including the ES2 backend)

parasti · on March 11, 2023

My bad, I completely forgot they worked off of a fork.

Did you ever try Neverball with glshim, though? Why? I'm the current maintainer of Neverball, always very interested in Neverball-related stuff.

When I was working on the Emscripten port, I tried a bunch of GL-to-OpenGL ES libraries. There were a few. I might have tried glshim, but not sure anymore, that was a couple of years ago. gl4es was the only one that worked without any obvious visual glitches.

lunixbochs · on Feb 18, 2023

The Talon model is fairly accurate, but it can be confusing for new users to use the command system correctly. I posted a sibling reply about this, but the most common reason for Talon users to complain about the recognition is that they are in the strict "command mode" and say things that aren't actually commands.

If you encounter what feels like poor recognition in Talon, I recommend enabling Save Recordings and zipping+sharing some examples on the Slack and asking for advice.

The current command set is definitely harder to learn than a system designed for chat/email where "what you say is what you get", but it's much more powerful for tasks like programming once you learn it.

I'm dubious about what kind of general command accuracy Numen is able to get with the Vosk models, as Vosk to my understanding is more designed for natural language than commands.

theusus · on Feb 19, 2023

Okay, I read some docs but I didn't find much info for writing code. Mostly editing already present.

lunixbochs · on Feb 18, 2023

Fixed commands are fast, precise, and predictable.

Assuming you mean speaking in natural language, that's slower to say, and likely less precise and predictable if you want to be able to just say "anything" any have a result.

You need a command system either way. If you want to express some precise intention, you need to understand what the command system will do.

There is a combined "mixed mode" system I've been testing in the talon beta where you can use both phrases and commands without switching modes.

lunixbochs · on Feb 18, 2023

Talon's eye tracking functions as a mouse replacement. Is there a specific demo you'd like to see? I can record one.

lunixbochs · on Feb 18, 2023

Depending on when that was: in 2018 the free model was the macOS speech engine, in 2019 it was a fast but relatively weak model, and as of late 2021 it's a much stronger model. I'm currently working on the next model series with a lot more resources than I had before.

It's also worth saying that if you only tried things out briefly, there are a handful of reasons recognition may have seemed worse. Talon uses a strict command system by default, because that improves precision and speed for trained users, but the tradeoff there is it's more confusing for people who haven't learned it yet.

For example, Talon isn't in "dictation mode" by default, so you need to switch to that if you're trying to write email-like text and don't want to prefix your phrases with a command like "say".

The timeout system may also be confusing at first. When you pause, Talon assumes you were done speaking and tries to run whatever you said. You can mitigate this by speaking faster or increasing the timeout.

The default commands (like the alphabet) may also just not be very good for some accents, and that will be the case for any speech engine - you will likely need to change some commands if they're hard to enunciate in your accent.

I recommend joining the slack [1] and asking there if you want more specific feedback. I definitely want to support many accents and even have some users testing Talon with other spoken languages.

[1] https://talonvoice.com/chat