HBM3 to Infinity Cache

Check. So, alright. I’m gonna talk a little bit about what I learned today, and I don’t, eventually I want this to have a structure where it makes sense to have actually deep learning and structured feedback. But, a part of this is also to show evidence that I’m actually thinking out loud, that I’m actually alive, and that I’m putting things together in a succinct, uh, manner that actually makes sense, is actually useful for people. But, let’s take baby steps in order to get there. One of the things that I want to do is I want to kinda talk about the things that I learned today, particularly when it comes to hardware software co-design. And, I’m gonna focus primarily on the MI300, uh, X. And this is a particular older chip that’s used heavily for inference when it comes to AMD. Oftentimes when we talk about vendors for chips, we’re going to talk about NVIDIA, AMD. NVIDIA is pretty exhausted in terms of talking points. So I figure AMD would be a more interesting take because oftentimes when we look at these recent papers for optimization, they don’t use AMD as the default go-to hardware. But instead, they tend to go with H100 or H200 or something like that. Now, let’s take a look at the MI300X. The MI in MI300X stands for machine intelligence. There is two variants when it comes to the 300 family. There’s the X, which for the HBM, that’s the memory, the main memory, is 192 gigabytes. And for the A variant, I think it’s still in the hundreds, but slightly smaller. So, let’s just primarily focus on the MI300X. What did I learn so far from Roam memory? Well, we understand that there are different types of locations when it comes to the memory hierarchy. There is the storage, which oftentimes will be comprised of like an NVMe SSD. In addition to that, we are also interested in the main memory. When we reference and look at a particular chip, one of the things that we’ll look at is the memory. And when we take a look at the main memory, it’s going to be the MI300X, but there’s also going to be something like 192 gigabytes. That is in reference to the main memory. And the main memory is the HBM. The HBM stands for high bandwidth memory. And it contains all of the excess KV tensors that are generated from the auto-regressive decoding process when moving and processing everything together. So, in other words, when we deal with inference and actually use the model in real work with real prompts, with real users, what happens is we need to auto, do the auto-regressive decoding, generate these tokens, and these tokens, these KV tensors, these KVs, by the way, stands for key value, they need to be stored somewhere, and they’re going to be stored temporarily inside HBM. And that’s why having a large HBM is so important because when we deal with KV tens, or rather, when we deal with KV cache, the biggest bottleneck right now is memory-bound. Not compute-bound, but memory-bound. What is the difference between compute-bound and memory-bound? Well, let me pause real quick and give a log of status of what’s going on. So right now we’re trying to determine the difference between memory-bound and compute-bound. Let’s start with compute-bound, because it’s relatively simple. The compute-bound is where you use, in this particular case, matrix cores to do matrix, um, multiplication. This is to produce meaning and provide actual predictive, predictions of what would come back as in the form of tokens for the user at hand. So when we think about actually seeing something on screen, we need to leverage these tensor cores or in this case for AMD, it’s going to be the matrix cores, and do this mathematical, let’s say, equation, this algorithm, that does matrix multiplication, and then produce these, these tokens that can be used to represent the output that goes on the screen when you request or prompt an AI model. Unfortunately, right now, the HBM is so small, and the tokens are so large in quantity, that it overwhelms the HBM. And as a result, we need to figure out a solution that optimizes the HBM and reduces the need to rely on the HBM until we have better hardware solutions. In the future, this, we are moving towards a one terabyte HBM. Right, currently for this MI300, we have about 192 gigabytes worth of data size, which is significant, but obviously even a one session of a one user, they can overwhelm the HBM given long enough context. So, just that’s the kind of the background there. That we’re trying to do things with purpose here. And when we take a look at the HBM, it is connected to, with this interconnect, to this thing, it’s kind of like a bridge, called a silicon interposer. What is a silicon interposer? Well, it connects with this cache subsystem called Infinity Cache. And we’re talking specifically in the MI300X. And this Infinity Cache is just, think of it as the L3 version. We have the L1, L2, and then L3. In the Infinity Cache, this is a compute-bound problem. A compute-bound is where it’s close to the matrix multiplication. In this case, it’s going to be the matrix cores. And they do a lot of computation, and they need the data to temporarily be used and, and be processing temporarily for quick access. And that’s the reason why we have the Infinity Cache right next to the matrix cores in order for excessive matrix multiplication. There is this bridge that connects the HBM and the Infinity Cache. This is called the silicon interposer. It is just a I think it’s on package, and it is a particular wiring that speeds through the connection and transfer of data between the, between the Infinity Cache and the HBM. So think of it as data transfer, or think of it as memory transfer. Um, we can also talk about the, the ceiling of it, which is the memory bandwidth. But, for now, it’s just a data transfer between two different layers within the memory hierarchy. So, beyond that, we also have a different type of interconnect that is well-known, and that is called the Infinity Fabric. This is a well-known piece of architecture that is in the AMD ecosystem. Basically, it has multiple different lanes that are connected together that is different than the, the NVIDIA structure. I think the NVIDIA is named NVLink. I could be wrong about that, but the Infinity Fabric is this interconnect that is kind of like a high-speed interconnect that connects the chiplets. And what are chiplets? Chiplets are these computations, these, um, accelerators that help, uh, put things together so that you can kind of meet very quickly between the chiplets in, in terms of data transfer. So the, the whole entire purpose of Infinity Fabric is to have fast data transfer or memory, high memory bandwidth. So, when we talk about HBM to these Tensor cores or these matrix cores, we’re going to be talking about how effective they are for achieving the highest memory bandwidth or the highest memory, um, processing that can be done here. So, without, uh, out of the way, let’s talk about how those, that, okay, so let’s pause real quick. What did we cover so far? So we talked about the memory hierarchy. We didn’t talk about much about the registers, but we talked about the cache where there’s L1, L2, and L3. The L3 is gonna be the Infinity Cache, which is 256 megabytes. The 256 megabytes is specific to the MI300X. Uh, the MI300A might be different. Then we also have the HBM, which is 192 gigabytes. And then it’s connected with this interconnect. And it’s probably gonna, there’s probably some Gen PCIe Gen Six thing. And that’s essentially it. I’m trying to piece everything together from a high level. Uh, right now, learning a lot about profiling, particularly when it comes to the hotspots and, and trying to put everything together that, that makes sense. And that’s where I am right now. Um, yeah. And I eventually want to get into actually profiling a piece of real hardware. So, stay tuned for that. Um, okay. Well, for now, that’s what I just want to talk about. I just want to make sure I, I at least contribute a blog or article every single day. Even if it’s low quality. Because then I can just keep information going. But in addition to that, I want to make sure that I’m detailing what I’ve learned about. And eventually, I want to put this on, on, uh, Twitch or YouTube Live or some streaming application to talk about and learn how in-person or a lot, um, live, um, about my process to learn, understand, do critical thinking, problem solve. I think that would be an interesting approach because a lot of the material, a lot of the content creation right now is very reactive in nature. There’s not a lot of what I call engineering work. Even if you do see the engineering work, unfortunately, they don’t, they’re oftentimes duplicates and they’re oftentimes very shallow. And they’re, they’re not particularly transparent on, on methodologies. And I wonder if there’s a way to do this that is highly transparent, but yet is meaningful and actually moves the needle. So, I’m still debating on how best to approach that, but I’m thinking right now leaning towards kind of this Twitch slash YouTube approach. Anyway, until next time.