This is undoubtedly why they launched the Mac Mini today. They can ramp up a lot more power in that machine without a battery and with a larger, active cooler.
I'm much more interested in actual benchmarks. AMD has mostly capped their APU performance because DDR4 just can't keep the GPU fed (why the last 2 generations of consoles went with very wide GDDR5/6). Their solution is obviously Infinity Cache where they add a bunch of cache on-die to reduce the need to go off-chip. At just 16B transistors, Apple obviously didn't do this (at 6 transistors per SRAM cell, there's around 3.2B transistors in just 64MB of cache).
Okay, probably a stupid question, but solid state memory can be pretty dense: why don't we have huge caches, like a 1GB cache? As I understand it, cache memory doesn't put off heat like the computational part of the chip does, so heat dissipation probably wouldn't increase much with a larger chip package.
Nand flash is pretty dense, but way too slow. Sram is fast but not at all dense, needing 6 transistors per bit.
For reference: https://en.m.wikipedia.org/wiki/Transistor_count lists all the largest cpu as of 2019 amd's epyc Rome at 39.54 billion MOSFETs, so even if you replaced the entire chip with Sram you wouldn't even quite reach 1GB!
Nand is nonvolatile and the tradeoff with that is write cycles. We have an inbetween in the form of 3D Xpoint (Optane), Intel is still trying to figure out the best way to use it. It currently like an L6 cache after system DRAM.
Well not just Intel. Optane is a new point in the memory hierarchy. That has a lot of implications for how software is designed, it's not something Intel can do all by itself.
SRAM is 6 transistors per bit, so you're taking about 48 billion transistors there, and that's ignoring the overhead of all the circuits around the cells themselves.
DRAM is denser, but difficult to build on the same process as logic.
That said, with chiplets and package integration becoming more common, who knows... One die of DRAM as large cache combined with a logic die may start to make more sense. It's certainly something people have tried before, it just didn't really catch on.
I don't know the details, but the manufacturing process is pretty different. Trying to have one process that's good at both DRAM and logic at the same time is hard, because they optimize for different things.
Are you are referring to latency due to propagation delay where the worst case increases as you scale?
Would you mind elaborating a bit? I'm not following how this would significantly close the gap between SRAM and DRAM at 1GB. Since an SRAM cell itself is generally faster than a DRAM cell, and I understand that circuitry beyond an SRAM cell itself is far simpler than DRAM. Am I missing something?
Think of a circular library with a central atrium and bookshelves arranged in circles radiating out from the atrium. In the middle of the atrium you have your circular desk. You can put books on your desk to save yourself the trouble of having to go get them off the shelves. You can also move books to shelves that are closer to the atrium so they're quicker to get than the ones farther away.
So what's the problem? Well, your desk is the fastest place you can get books from but you clearly can't make your desk the size of the entire library, as that would defeat the purpose. You also can't move all of the books to the innermost ring of shelves, since they won't fit. The closer you are to the central atrium, the smaller the bookshelves. Conversely, the farther away, the larger the bookshelves.
Circuits don't follow this ideal model of concentric rings, but I think it's a nice rough approximation for what's happening here. It's a problem of geometry, not a problem of physics, and so the limitation is even more fundamental than the laws of physics. You could improve things by going to 3 dimensions, but then you would have to think about how to navigate a spherical library, and so the analogy gets stretched a bit.
Area is a big one. Why isn't L1 MB? Because you can't put that much data close enough to the core.
Look at a Zen-based EPYC core- 32KB of L1 with 4 cycle latency, 512KB of L2 with 12 cycle latency, 8MB of L3 with 37 cycle latency.
L1 to L2 is 3x slower for 8x more memory, L2 to L3 is 3x slower for 16x more memory.
You can reach 9x more area in 3x more cycles, so you can see how the cache scaling is basically quadratic (there's a lot more execution machinery competing for area with L1/L2, so it's not exact).
I am sure there are many factors, but the most basic one is that the more memory you have, the longer it takes to address that memory. I think it scales with the log of the ram size, which is linearly with the number of address bits.
Log-depth circuits are a useful abstraction but the constraints of laying out circuits in physical space imposes a delay scaling limit of O(n^(1/2)) for planar circuits (with a bounded number of layers) and O(n^(1/3)) for 3D circuits. The problem should be familiar to anyone who's drawn a binary tree on paper.
With densities so high, and circuit boards so small (when they want to be), that factor isn't very important here.
We regularly use chips with an L3 latency around 10 nanoseconds, going distances of about 1.5 centimeters. You can only blame a small fraction of a nanosecond on the propagation delays there. And let's say we wanted to expand sideways, with only a 1 or 2 nanosecond budget for propagation delays. With a relatively pessimistic assumption of signals going half the speed of light, that's a diameter of 15cm or 30cm to fit our SRAM into. That's enormous.
in Zen 1 and Zen 2, cores have direct or indirect access to the shared L3 cache in the same CCX. In the cross-CCX case the neighboring CCX cache can be accessed over the in-package interconnect without going through system DRAM.
AnandTech speculates on a 128-bit DRAM bus[1], but AFAIK Apple hasn't revealed rich details about the memory architecture. It'll be interesting to see what the overall memory bandwidth story looks like as hardware trickles out.
Apple being Apple, we won't know much before someone grinds down a couple chips to reveal what the interconnections are, but if you are feeding GPUs along with CPUs, a wider memory bus between DRAM and the SoC cache makes a lot of sense.
I am looking forward to compile time benchmarks for Chromium. I think this chip and SoC architecture may make the Air a truly fantastic dev box for larger projects.
As someone who works with a lot of interdisciplinary teams, I often understand concepts or processes they have names for but don't know the names until after they label them for me.
Until you use some concept so frequently you need to label it to compress information for discussion purposes, you often don't have names for them. Chances are if you solve or attempt to solve a wide variety problems, you'll see patterns and processes that overlap.
It’s often valuable to use jargon from another discipline in discussions. It sort of kicks discussions out of ruts. Many different disciplines use different terminology for similar basic principles. How those other disciplines extend these principles may lead to entirely different approaches and major (orders of magnitude) improvements. I’ve done it myself a few times.
On another note, the issue of “jargon” as an impediment to communication has led the US Military culture to develop the idea of “terms of art”. The areas of responsibility of a senior officer are so broad that they enter into practically every professional discipline. The officer has to know when they hear an unfamiliar term that they are being thrown off by terminology rather than lack of understanding. Hence the phrase “terms of art”. It flags everyone that this is the way these other professionals describe this, so don’t get thrown or feel dumb.
No one expects the officer to use (although they could) a “term of art”, but rather to understand and address the underlying principle.
It’s also a way to grease the skids of discussion ahead of time. “No, General, we won’t think you’re dumb if you don’t use the jargon, but what do you think of the underlying idea...”
Might be a good phrase to use in other professional cultures. In particular in IT, because of the recursion of the phrase “term of art” being itself be a term of art until it’s generally accepted. GNU and all that...
This gets even more fun when several communities discover the same thing independently, and each comes up with a different name for it.
My favorite is the idea of "let's expand functions over a set of Gaussians". That is variously known as a Gabor wavelet frame, a coherent state basis [sic], an Gaussian wave packet expansion, and no doubt some others I haven't found. Worse still, the people who use each term don't know about any of the work down by people who use the other terms.
Reminds me of self-taught tech. I’ll often know the name/acronym, but pronounce it differently in my head than the majority of people. Decades ago GUI was “gee you eye” in my head but one day I heard it pronounced “gooey” and I figured it out but had a brief second of “hwat?” (I could also see “guy” or “gwee”.) It’s, of course, more embarrassing when I say it out loud first...
First time I went to a Python conference in SV, more than a decade ago, I kept hearing "Pie-thon" everywhere, and had no idea what the hell people were talking about.
I took me a solid half hour to at last understand this pie-thingy was Python... in my head I had always pronounced it the French way. Somewhat like "pee-ton", I don't know how to transcribe that "on" nasal sound... (googling "python prononciation en francais" should yield a soundtrack for the curious non-French speakers).
Picture 18 year old me in 1995, I got a 486SX laptop as a graduation present out of the blue from my estranged father. I wanted to add an external CD-ROM to it so I could play games and load software for college, and it had a SCSI port. I went to the local computer store and asked the guy for a "ess see ess eye" CD-ROM drive, he busted out laughing and said "oh you mean a scuzzy drive?" Very embarrassing for me at the time but that's when I learned that computer acronyms have a preferred pronunciation so I should try to learn them myself to avoid future confusion.
it shouldn't be, it should be a badge of honor of some sorts - it points to somebody reading to expand their knowledge that is not available in oral form around them, so kudos to them !
It's even more visible in non-English speaking countries. In Poland: first everyone says Java as Yava and after a while they start to switch to a proper English pronunciation. Many times it divides amateurs from professionals, but I wouldn't really know, because I don't work with Java.
Great story, yes. But there's no such thing as a "halzenfugel" in German as far as I can tell as a native speaker. Even www.duden.de, the official German dictionary, doesn't know that word ;-0
As a native English speaker and middling foreign-language speaker of German, "halzenfugel" sounds to me like a mock-German word that an English speaker would make up.
Hah, good to know. However, unless you are talking to people from the same domain, it's usually a better approach to spell out things instead of relying on terminology. Concepts and ideas translate much better across domains than terminology.
For example, AMD sells 12 and 16 core CPUs. The 12 core parts have 2 cores lasered out due to defects. If a particular node is low-yield, then it's not super uncommon to double-up on some parts of the chip and use either the non-defective or best performing one. You'll expect to see a combination of lasering and binning to adjust yields higher.
That said, TSMC N5 has a very good defect rate according to their slides on the subject[0]
Yep for the MBA. I think for devs that can live with 16GB, the cheaper 7GPU MacBook Air is very interesting instead of the MacBook Pro for $300 cheaper.
Plus, defects tend to be clustered, which is a pretty lucky effect. Multiple defects on a single core don't really matter if you are throwing the whole thing away.