Hacker Newsnew | past | comments | ask | show | jobs | submit | zeusk's commentslogin

They’re used by the internal register renamer/allocator so if it sees you’re storing the results to memory then reusing the named register for a new result - it will allocate a new physical register so your instruction doesn’t stall for the previous write to go through.

I do not understand what you want to say.

The register renamer allocates a new physical register when you attempt to write the same register as a previous instruction, as otherwise you would have to wait for that instruction to complete, and you would also have to wait for any instructions that would want to read the value from that register.

When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

The only optimization is that if a following instruction attempts to read the value stored in the memory, that instruction does not wait for the previous store to complete, in order to be able to load the stored value from the memory, but it gets the value directly from the store queue. But this has nothing to do with register renaming.

Thus if your algorithm has already used all the visible register numbers, and you will still need in the future all the values that occupy the registers, then you have to store one register into the memory, typically in the stack, and the register renamer cannot do anything to prevent this.

This is why Intel will increase the number of architectural general-purpose registers of x86-64 from 16 to 32, matching Arm Aarch64 and IBM POWER, with the APX ISA extension, which will be available in the Nova Lake desktop/laptop CPUs and in the Diamond Rapids server CPUs, which are expected by the end of this year.

Register renaming is a typical example of the general strategy that is used when shared resources prevent concurrency: the shared resources must be multiplied, so that each concurrent task uses its private resource.


> When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

you are of course correct about everything. But the extreme pendant in me can't avoid pointing out that there are in fact a few mainstream CPUs[1] that can rename memory to physical registers, at least in some cases. This is done explicitly to mitigate the cost of spilling. edit: this is different from the store-forwarding optimization you mentioned.

[1] Ryzen for example: https://www.agner.org/forum/viewtopic.php?t=41


That feature does not exist in any AMD Zen, but only in certain Zen generations and randomly, i.e. not in successive generations. This optimization has been introduced then removed a couple of times. Therefore this is not an optimization on whose presence you can count in a processor.

I believe that it is not useful to group such an optimization with register renaming. The effect of register renaming is to replace a single register shared by multiple instructions with multiple registers, so that each instructions may use its own private register, without interfering with the other instructions.

On the other hand, the optimization mentioned by you is better viewed as an enhancement of the optimization mentioned by me, and which is implemented in all modern CPUs, i.e. that after a store instruction the stored value persists for some time in the store queue and the subsequent instructions can access it there instead of going to memory.

With this additional optimization, the stored values that are needed by subsequent instructions are retained in some temporary registers even after the store queue is drained to the memory as long as they are still needed.

Unlike with register renaming, here the purpose is not to multiply the memory locations that store a value so that they can be accessed independently. Here the purpose is to cache the value close to the execution units, to be available quickly, instead of taking it from the far away memory.

As mentioned at your link, the most frequent case when this optimization is efficient is when arguments are pushed in the stack before invoking a function and then the invoked function loads the arguments in registers. On the CPUs where this optimization is implemented the passing of arguments to the function bypasses the stack, becoming much faster.

However this calling convention is important mainly for legacy 32-bit applications, because the 64-bit programs pass most arguments inside registers, so they do not need this optimization. Therefore this optimization is more important for Windows, where it is more frequent to use ancient 32-bit executables, which have not been recompiled to 64-bit.


Yes, it is not in all Zen cpus.

I don't think it makes sense to distinguish it from renaming. It is effectively aliasing a memory location (or better, an offset off the stack pointer) with a physical register, effectively treating named stack offsets as additional architectural registers. AFAIK this is done on the renaming stage.


The named stack offsets are treated as additional hidden registers, not as additional architectural registers.

You do not access them using architectural register numbers, as you would do with the renamed physical registers, but you access them with an indexed memory addressing mode.

The aliasing between a stack location and a hidden register is of the same nature as the aliasing between a stack location from its true address in the main memory and the location in the L1 cache memory where the the stack locations are normally cached in any other modern CPU.

This optimization present in some Zen CPUs just caches some locations from the stack even closer to the execution units of the CPU core than the L1 cache used for the same purpose in other CPUs, allowing those stack locations to be accessed as fast as the registers.


The stack offset (or in general memory location address[1]) has a name (its unique address), exactly like an architectural register, how can it be an hidden register?

In any case, as far as I know the feature is known as Memory Renaming, and it was discussed in Accademia decades before it showed in actual consumer CPUs. It uses the renaming hardware and it behaves more like renaming (0 latency movs resolved at rename time, in the front end) than an actual cache (that involves an AGI unit and a load unit and it is resolved in the execution stages, in the OoO backend) .

[1] more precisely, the feature seems to use address expressions to name the stack slots, instead of actual addresses, although it can handle offset changes after push/pop/call/ret, probably thanks to the Stack Engine that canonicalizes the offsets at the decode stage.


This has already been tried :)

iirc, in the 2016 a quadcore intel cpu ran the original crysis at ~15fps


Get the DGX Spark computers? They’re exactly what you’re trying to build.


They’re very slow.


They're okay, generally, but slow for the price. You're more paying for the ConnectX-7 networking than inference performance.


Yeah, I wouldn’t complain if one dropped in my lap, but they’re not at the top of my list for inference hardware.

Although... Is it possible to pair a fast GPU with one? Right now my inference setup for large MoE LLMs has shared experts in system memory, with KV cache and dense parts on a GPU, and a Spark would do a better job of handling the experts than my PC, if only it could talk to a fast GPU.

[edit] Oof, I forgot these have only 128GB of RAM. I take it all back, I still don’t find them compelling.


the TB5 link (RDMA) is much slower than direct access to system memory


Nvidia has been investing in confidential compute for inference workloads in cloud - that covers physical ownership/attacks in their thread model.

https://www.nvidia.com/en-us/data-center/solutions/confident...

https://developer.nvidia.com/blog/protecting-sensitive-data-...


It's likely I'm mistaken about details here but I _think_ tee.fail bypassed this technology and the AT article covers exactly that.


> And yet, if I go to Youtube or just about any other modern site, it takes literally a minute to load and render, none of the UI elements are responsive, and the site is unusable for playing videos. Why? I'm not asking for anything the hardware isn't capable of doing.

but the website and web renderer are definitely not optimized for a netbook from 2010 - even modern smartphones are better at rendering pages and video than your atom (or even 8350u) computers.


> even modern smartphones are better

That's an understatement if I've ever seen one! For web rendering single-threaded performance is what mostly matters and smartphones got crazy good single-core performance these days. The latest iPhone has faster single core than even most laptops


Yes, but parent comment definitely implied they weren't talking about people running on the latest and best out there. Even the middle-grade smartphones today are leaps and bounds better than the atom from 2010.


What do you mean by that? Most syscalls are still interrupt based.


x86-64 introduced a `syscall` instruction to allow syscalls with a lower overhead than going through interrupts. I don't know any reason to prefer `int 80h` over `syscall` when the latter is available. For documentation, see for example https://www.felixcloutier.com/x86/syscall


While AMD syscall or Intel sysenter can provide a much higher performance than the old "int" instructions, both syscall and sysenter have been designed very badly, as explained by Linus himself in many places. It is extremely easy to use them in ways that do not work correctly, because of subtle bugs.

It is actually quite puzzling why both the Intel designers and the AMD designers have been so incompetent in specifying a "syscall" instruction, when such instructions, but well designed, had been included in many other CPU ISAs for many decades.

When not using an established operating system, where the implementation for "syscall" has been tested for many years and hopefully all bugs have been removed, there may be a reason to use the "int" instruction to transition into the privileged mode, because it is relatively foolproof and it requires a minimum amount of code to be handled.

Now Intel has specified FRED, a new mechanism for handling interrupts, exceptions and system calls, which does not have any of the defects of "int", "syscall" and "sysenter".

The first CPU implementing FRED should be Intel Panther Lake, to be launched by the end of this year, but surprisingly, recently when Intel has made a presentation providing information about Panther Lake no word was said about FRED, even if this is expected to be the greatest innovation of Panther Lake.

I hope that the Panther Lake implementation of FRED is not buggy, which could have made Intel to disable it and postpone its introduction to a future CPU, like they have done many times in the past. For instance, the "sysenter" instruction was intended to be introduced in Intel Pentium Pro, by the end of 1995, but because of bugs it was disabled and not documented until Pentium II, in mid 1997, where it finally worked.


32 bit x86 also has sysenter/sysexit.


Only Intel. AMD had its own "syscall" instead of Intel's "sysenter" since the K6 CPU, so x86-64 has inherited that.

AMD's "syscall" corrects some defects of Intel's "sysenter", but unfortunately it introduces some new defects.

Details can be found in the Linux documentation, in comments by Linus Torvalds about the use of these instructions in the kernel.


Double buffering a 4K 4bpp framebuffer itself is 64mb


> same hardware as the higher end models but needs a firmware bit flip

Is this firmware bit flip known? couldn't find anything off google.



This is incredible. I had no idea that there is a Homebrew channel!


AUM is not theirs to keep; and market cap is a very deceitful metric especially for banks where liabilities dwarf the market cap.


My point is that it’s a minor transaction for them


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: