How important is L3 cache for AMD processors?

Indeed, it makes sense to equip multi-core processors with dedicated memory that will be shared by all available cores. In this role, a fast third-level (L3) cache can significantly speed up access to data that is requested most often. Then the cores, if possible, will not have to access slow main memory (RAM).

At least in theory. AMD recently announced the Athlon II X4 processor, which is a model of the Phenom II X4 without the L3 cache, hinting that it is not that necessary. We decided to directly compare two processors (with and without L3 cache) to test how the cache affects performance.

How does the cache work?

Before we dive into the tests, it's important to understand some basics. The principle of how the cache works is quite simple. The cache buffers data as close to the processing cores of the processor as possible to reduce CPU requests to more distant and slow memory. On modern desktop platforms, the cache hierarchy includes as many as three levels that precede access to random access memory. Moreover, caches of the second and, in particular, third levels serve not only to buffer data. Their purpose is to prevent the processor bus from becoming overloaded when the cores need to exchange information.

Hits and misses

The effectiveness of cache architectures is measured by hit rate. Data requests that can be satisfied by the cache are considered hits. If given cache does not contain the required data, the request is passed further along the memory pipeline, and a miss is counted. Of course, misses lead to more time required to obtain information. As a result, “bubbles” (idles) and delays appear in the computing pipeline. Hits, on the contrary, allow you to maintain maximum performance.

Cache entry, exclusivity, coherence

Replacement policies dictate how space is freed up in the cache for new entries. Because data written to the cache must eventually appear in main memory, systems may do so at the same time as writing to the cache (write-through), or may mark the data areas as "dirty" (write-back) and write to memory. when it is evicted from the cache.

Data in several cache levels can be stored exclusively, that is, without redundancy. Then you won't find the same data lines in two different cache hierarchies. Or caches can work inclusively, that is, the lower levels of the cache are guaranteed to contain the data present in upper levels cache (closer to the processor core). U AMD Phenom exclusive L3 cache is used, while Intel follows an inclusive cache strategy. Coherency protocols ensure the integrity and freshness of data across different cores, cache levels, and even processors.

Cache size

A larger cache can hold more data, but tends to increase latency. In addition, a large cache consumes a considerable number of processor transistors, so it is important to find a balance between the transistor budget, die size, power consumption and performance/latency.

Associativity

Entries in RAM can be directly mapped to the cache, that is, there is only one cache position for a copy of data from RAM, or they can be n-way associative, that is, there are n possible locations in the cache where this data may be stored. More high degree associativity (up to fully associative caches) provides the best caching flexibility because existing data in the cache does not need to be rewritten. In other words, a high n-degree of associativity guarantees a higher hit rate, but it also increases latency because it takes more time to check all those associations for a hit. Typically, the highest degree of association is reasonable for the last level of caching, since the maximum capacity is available there, and searching for data outside of this cache will result in the processor accessing slow RAM.

Here are some examples: Core i5 and i7 use 32 KB of L1 cache with 8-way associativity for data and 32 KB of L1 cache with 4-way associativity for instructions. It's understandable that Intel wants instructions to be available faster and the L1 data cache to have a maximum hit rate. L2 cache Intel processors has 8-way associativity, and Intel's L3 cache is even smarter, since it implements 16-way associativity to maximize hits.

However, AMD is following a different strategy with the Phenom II X4 processors, which uses a 2-way associative L1 cache to reduce latency. To compensate for possible misses, the cache capacity was doubled: 64 KB for data and 64 KB for instructions. The L2 cache has 8-way associativity, like the Intel design, but AMD's L3 cache operates with 48-way associativity. But the decision to choose one cache architecture over another cannot be assessed without considering the entire CPU architecture. It is quite natural that test results have practical significance, and our goal was precisely a practical test of this entire complex multi-level caching structure.

Welcome to GECID.com! It is well known that the clock speed and number of processor cores directly affect the level of performance, especially in multi-threaded designs. We decided to check what role the L3 cache plays in this?

To study this issue, we were kindly provided by the online store pcshop.ua with a 2-core processor with a nominal operating frequency of 3.7 GHz and 3 MB of L3 cache with 12 associative channels. The opponent was a 4-core one, with two cores disabled and the clock frequency reduced to 3.7 GHz. Its L3 cache size is 8 MB, and it has 16 associative channels. That is, the key difference between them lies precisely in the last level cache: the Core i7 has 5 MB more of it.

If this significantly affects performance, then it will be possible to conduct another test with a representative of the Core i5 series, which has 6 MB of L3 cache on board.

But for now let's return to the current test. A video card and 16 GB of DDR4-2400 MHz RAM will help participants. We will compare these systems in Full HD resolution.

First, let's start with out-of-sync live gameplay, in which it is impossible to clearly determine the winner. IN Dying Light on maximum settings quality, both systems show a comfortable FPS level, although the processor and video card load on average was higher in the case of Intel Core i7.

Arma 3 has a pronounced processor dependence, which means a larger amount of cache memory should play a positive role even at ultra-high graphics settings. Moreover, the load on the video card in both cases reached a maximum of 60%.

A game DOOM at ultra-high graphics settings it allowed synchronizing only the first few frames, where the advantage of the Core i7 is about 10 FPS. The desynchronization of further gameplay does not allow us to determine the degree of influence of the cache on the speed of the video sequence. In any case, the frequency was kept above 120 frames/s, so even 10 FPS does not have much impact on the comfort of play.

Completes the mini-series of live gameplays Evolve Stage 2. Here we would probably see a difference between the systems, since in both cases the video card is approximately half loaded. Therefore, it subjectively seems that the FPS level in the case of the Core i7 is higher, but it is impossible to say for sure, since the scenes are not identical.

Benchmarks provide a more informative picture. For example, in GTA V You can see that outside the city the advantage of 8 MB cache reaches 5-6 frames/s, and in the city - up to 10 FPS due to the higher load on the video card. At the same time, the video accelerator itself in both cases is far from being loaded to its maximum, and everything depends on the CPU.

The Third Witcher we launched with extreme graphics settings and a high post-processing profile. In one of the scripted scenes, the advantage of the Core i7 in some places reaches 6-8 FPS with a sharp change in angle and the need to load new data. When the load on the processor and video card reaches 100% again, the difference decreases to 2-3 frames.

Maximum preset graphic settings V XCOM 2 was not a serious test for both systems, and the frame rate was around 100 FPS. But here, too, a larger amount of cache memory was transformed into an increase in speed from 2 to 12 frames/s. And although both processors failed to load the video card to the maximum, the 8 MB version was even better in this regard in some places.

The game surprised me the most Dirt Rally, which we launched with the preset very high. IN certain moments the difference reached 25 fps solely due to the larger L3 cache. This allowed the video card to be loaded 10-15% better. However, the average benchmark indicators showed more modest victory Core i7 - only 11 FPS.

An interesting situation has arisen with Rainbow Six Siege: on the street, in the first frames of the benchmark, the advantage of the Core i7 was 10-15 FPS. Indoors, the processor and video card load in both cases reached 100%, so the difference decreased to 3-6 FPS. But at the end, when the camera went outside the house, the Core i3 lag again exceeded 10 fps in places. The average figure turned out to be 7 FPS in favor of 8 MB of cache.

The Division at maximum quality graphics also responds well to an increase in cache memory. Already the first frames of the benchmark fully loaded all the Core i3 threads, but the total load on the Core i7 was 70-80%. However, the difference in speed at these moments was only 2-3 FPS. A little later, the load on both processors reached 100%, and at certain moments the difference was already behind the Core i3, but only by 1-2 frames/s. On average, it was about 1 FPS in favor of the Core i7.

In turn, the benchmarkRise of Tomb Riderat high graphics settings in all three test scenes, it clearly showed the advantage of a processor with a significantly larger amount of cache memory. Its average performance is 5-6 FPS better, but if you carefully look at each scene, in some places the lag of the Core i3 exceeds 10 frames/s.

But when choosing a preset with very high settings, the load on the video card and processors increases, so in most cases the difference between systems is reduced to a few frames. And only for a short time can Core i7 show more significant results. The average indicators of its advantage according to the benchmark results decreased to 3-4 FPS.

Hitman also less affected by L3 cache. Although here, with an ultra-high detail profile, an additional 5 MB provided better loading of the video card, turning this into an additional 3-4 frames/s. They don’t have a particularly critical impact on performance, but for purely sporting reasons, it’s nice that there is a winner.

High graphics settings Deus ex: Mankind divided immediately demanded maximum computing power from both systems, so the difference at best was 1-2 frames in favor of the Core i7, as indicated by the average.

Running it again with the ultra-high preset loaded the video card even more, so the impact of the processor on the overall speed became even less. Accordingly, the difference in L3 cache had virtually no effect on the situation and the average FPS differed by less than half a frame.

Based on the testing results, it can be noted that the L3 cache memory does have an impact on gaming performance, but it only appears when the video card is not loaded on full power. In such cases, it would be possible to get a 5-10 FPS increase if the cache were increased by 2.5 times. That is, approximately it turns out that, all other things being equal, each additional MB of L3 cache memory adds only 1-2 FPS to the video display speed.

So, if we compare neighboring lines, for example, Celeron and Pentium, or models with different amounts of L3 cache memory within the Core i3 series, then the main performance increase is achieved thanks to more high frequencies, and then the presence of additional processor threads and cores. Therefore, when choosing a processor, first of all, you still need to focus on the main characteristics, and only then pay attention to the amount of cache memory.

That's all. Thank you for your attention. We hope this material was useful and interesting.

Article read 27046 times

Subscribe to our channels

Today's article is not an independent material - it simply continues the study of the performance of three generations of the Core architecture under equal conditions (started at the end of last year and continued recently). True, today we will take a small step to the side - the frequencies of the cores and cache memory will remain the same as before, but the capacity of the latter will decrease. Why is this necessary? We used the “full” Core i7 of the last two generations for the purity of the experiment, testing it with support for Hyper-Threading technology enabled and disabled, since for a year and a half now Core i5 has been equipped with not 8, but 6 MiB L3. It is clear that the impact of cache memory capacity on performance is not as great as is sometimes believed, but it is there, and there is no escape from it. In addition, Core i5 are more mass-produced products than Core i7, and in the first generation no one “offended” them in this regard. But before they were limited a little differently: the UnCore clock speed in the first generation i5 was only 2.13 GHz, so our “Nehalem” is not exactly a representative of the 700 line at 2.4 GHz, but a slightly faster processor . However, we considered it unnecessary to greatly expand the list of participants and redo the testing conditions - all the same, as we have warned more than once, testing this line does not provide any new practical information: real processors operate in completely different modes. But for those who want to thoroughly understand all the subtle points, we think such testing will be interesting.

Test bench configuration

We decided to limit ourselves to just four processors, and there will be two main participants: both quad-core Ivy Bridge, but with different L3 cache capacities. The third is “Nehalem HT”: last time, in terms of the final score, it turned out to be almost identical to “Ivy Bridge Simply”. And “simply Nehalem” which, as we have already said, is a little faster than the real first-generation Core i5 operating at 2.4 GHz (due to the fact that in the 700 line the UnCore frequency was slightly lower), but not too radical. But the comparison is interesting: on the one hand, there are two steps to improve the microarchitecture, on the other, the cache memory has been limited. A priori, we can assume that the first will outweigh in most cases, but how much and in general - how comparable are the “first” and “third” i5s (adjusted for the UnCore frequency, of course, although if there are many people who want to see an absolutely accurate comparison, we will do it later let's do it) - already good topic for research.

Testing

Traditionally, we divide all tests into a number of groups and show on diagrams the average result for a group of tests/applications (you can find out more about the testing methodology in a separate article). The results in the diagrams are given in points; the performance of the reference test system from the 2011 sample site is taken as 100 points. It is based on the AMD Athlon II X4 620 processor, but the amount of memory (8 GB) and video card () are standard for all tests of the “main line” and can only be changed within the framework of special studies. For those who are interested in more detailed information, again, it is traditionally proposed to download a table in Microsoft Excel format, in which all the results are presented both converted into points and in “natural” form.

Interactive work in 3D packages

There is some effect of cache capacity, but it is less than 1%. Accordingly, both Ivy Bridges can be considered identical to each other, and architectural improvements allow the new Core i5 to easily outperform the old Core i7 in the same way as the new Core i7 does.

Final rendering of 3D scenes

In this case, of course, no improvements can compensate for the increase in the number of processed threads, but today the most important thing for us is not this, but the complete lack of influence of cache memory capacity on performance. Here are Celeron and Pentium, as we have already established, different processors, so rendering programs are sensitive to L3 capacity, but only when the latter is low. And 6 MiB for four cores, as we see, is quite enough.

Packing and Unpacking

Naturally, these tasks are susceptible to cache memory capacity, however, here too the effect of increasing it from 6 to 8 MiB is quite modest: approximately 3.6%. More interesting, in fact, is the comparison with the first generation - architectural improvements allow the new i5 to smash even the old i7 at equal frequencies, but this is in the overall standings: due to the fact that two of the four tests are single-threaded, and another is dual-threaded. Data compression using 7-Zip is naturally fastest on Nehalem HT: eight threads are always faster than four with comparable performance. But if we limit ourselves to just four, then our “Ivy Bridge 6M” loses not only to its progenitor, but also to the old Nehalem: microarchitecture improvements completely give in to the reduction in cache memory capacity.

Audio encoding

What was somewhat unexpected was not the size of the difference between the two Ivy Bridges, but the fact that there was any difference at all. The truth is so cheap that it can be attributed to rounding or measurement errors.

Compilation

Threads are important, but so is cache capacity. However, as usual, not too much - about 1.5%. A more interesting comparison is with the first generation Core with Hyper-Threading disabled: “on points” the new Core i5 wins even at the same frequency, but one of the three compilers (made by Microsoft, to be precise) worked on both processors in the same amount of time. Even with a 5 second advantage for the older one - despite the fact that in this program the “full cache” Ivy Bridge results are 4 seconds better than Nehalem. In general, here too we cannot assume that the reduction in L3 capacity somehow greatly affected the second and third generation Core i5, but there are some nuances.

Mathematical and engineering calculations

Again, less than 1% difference with the “older” crystal and again a convincing victory over the first generation in all its forms. Which is more the rule than the exception for such low-threaded tests, but why not make sure of it once again? Especially in such a refined form, when (unlike tests in normal mode) the difference in frequencies (“standard” or appearing due to Turbo works Boost).

Raster graphics

But even with a more complete utilization of multithreading, the picture does not always change. And the cache memory capacity does not give anything at all.

Vector graphics

And it’s the same here. True, only a couple of computation threads are needed.

Video encoding

Unlike this group, where, however, even Hyper-Threading does not allow Nehalem to fight on equal terms with the followers of newer generations. But they are not too hampered by the reduction in cache memory capacity. More precisely, it practically does not interfere at all, since the difference is again less than 1%.

Office software

As one might expect, there is no performance gain from increasing the cache memory capacity (more precisely, there is no drop from decreasing it). Although if you look at the detailed results, you can see that the only multi-threaded test in this group (namely text recognition in FineReader) runs about 1.5% faster with 8 MiB L3 than with 6 MiB. It would seem - what is 1.5%? From a practical point of view - nothing. But from a research point of view it is already interesting: as we see, it is multi-threaded tests that most often lack cache memory. As a result, the difference (albeit small) is sometimes found even where it seems like it shouldn’t be. Although there is nothing so inexplicable about this - roughly speaking, in low-threaded tests we have 3-6 MiB per thread, but in multi-threaded tests it turns out to be 1.5 MiB. There is a lot of the first, but the second may not be quite enough.

Java

However, the Java machine does not agree with this assessment, but this is also understandable: as we have written more than once, it is very well optimized not for x86 processors at all, but for phones and coffee makers, where there can be a lot of cores, but the cache is very little memory. And sometimes there are few cores and cache memory - expensive resources both in terms of chip area and power consumption. And, if it’s possible to do something with cores and megahertz, then with the cache everything is more complicated: in the quad-core Tegra 3, for example, there is only 1 MiB. It is clear that the JVM can “squeeze” more (like all bytecode systems), which we have already seen when comparing Celeron and Pentium, but more than 1.5 MiB per thread, if it can be useful, is not in those tasks. which were included in SPECjvm 2008.

Games

We had high hopes for games, since they often turn out to be more demanding of cache memory capacity than even archivers. But this happens when there is very little of it, and 6 MiB, as we see, is enough. And, again, quad-core Core processors of any generation, even at a frequency of 2.4 GHz, are too powerful a solution for the gaming applications used, so the bottleneck will clearly not be them, but other components of the system. That's why we decided to dust off the modes with low quality graphics - it is clear that for such systems it is too synthetic, but our testing is all synthetic :)

When all sorts of video cards and so on don’t interfere, the difference between the two Ivy Bridges reaches an “insane” 3%: in this case, you can ignore it in practice, but in theory it’s a lot. More came out just in archivers.

Multitasking environment

We've already seen this somewhere. Well, yes - when we tested six-core processors under LGA2011. And now the situation repeats itself: the load is multi-threaded, some of the programs used are “greedy” for the cache memory, but increasing it only reduces the average performance. How can this be explained? Except that arbitration becomes more complicated and the number of mistakes increases. Moreover, we note that this only happens when the L3 capacity is relatively large and there are at least four simultaneously working computation threads - in the budget segment the picture is completely different. In any case, as our recent testing of Pentium and Celeron showed, for dual-core processors, increasing L3 from 2 to 3 MiB adds 6% performance. But it doesn’t give anything to four- and six-core ones, to put it mildly. Even less than nothing.

Total

The logical overall result: since no significant difference was found anywhere between processors with different L3 sizes, there is none “in general.” Thus, there is no reason to be upset about the reduction in cache memory capacity in the second and third generations of Core i5 - the predecessors of the first generation are not competitors to them anyway. And older Core i7s, on average, also demonstrate only a similar level of performance (of course, mainly due to the lag in low-threaded applications - and there are scenarios that, under equal conditions, they handle faster). But, as we have already said, in practice real processors are far from being on equal terms in terms of frequencies, so the practical difference between generations is greater than can be obtained in such studies.

Only one question remains open: we had to greatly reduce the clock frequency to ensure equal conditions with the first generation Core, but will the observed patterns persist in conditions closer to reality? After all, just because four low-speed computation threads do not see the difference between 6 and 8 MiB of cache memory, it does not follow that it will not be detected in the case of four high-speed ones. True, the opposite does not follow, so in order to finally close the topic of theoretical research, we will need one more laboratory work, which we will do next time.

When performing various tasks, the processor of your computer receives the necessary blocks of information from RAM. Having processed them, the CPU writes the obtained calculation results into memory and receives subsequent blocks of data for processing. This continues until the task is completed.

The above processes are carried out at very high speed. However, the speed of even the fastest RAM is significantly less than the speed of any weak processor. Every action, be it writing information to it or reading it from it, takes a lot of time. The speed of RAM is tens of times lower than the speed of the processor.

Despite this difference in information processing speed, the PC processor does not sit idle and does not wait for the RAM to issue and receive data. The processor is always working and all thanks to the presence of cache memory in it.

A cache is a special type of RAM. The processor uses cache memory to store those copies of information from the computer's main RAM that are likely to be accessed in the near future.

Essentially, cache memory acts as a high-speed memory buffer that stores information that the processor may need. Thus, the processor receives the necessary data tens of times faster than when reading it from RAM.

The main difference between a cache memory and a regular buffer is the built-in logical functions. The buffer stores random data, which is usually processed according to the “received first, issued first” or “received first, issued last” scheme. The memory cache contains data that is likely to be accessed in the near future. Therefore, thanks to the "smart cache", the processor can operate at full speed and not wait for data to be retrieved from the slower RAM.

Basic types and levels of cache memory L1 L2 L3

The cache memory is made in the form of static random access memory (SRAM) chips, which are installed on system board or built into the processor. Compared to other types of memory, static memory can operate at very high speeds.

Cache speed depends on the size of the specific chip. The larger the chip, the more difficult it is to achieve high speed for her work. Taking this feature into account, during manufacturing the processor cache memory is made in the form of several small blocks called levels. The most common today is the three-level cache system L1, L2, L3:

L1 cache memory - the smallest in volume (only a few tens of kilobytes), but the fastest in speed and the most important. It contains the data most frequently used by the processor and runs without delay. Typically, the number of L1 memory chips is equal to the number of processor cores, with each core accessing only its L1 chip.

L2 cache memory It is inferior in speed to L1 memory, but is superior in volume, which is already measured in several hundred kilobytes. It is intended for temporary storage important information, the probability of accessing which is lower than that of information stored in the L1 cache.

Third level cache L3 - has the largest volume of the three levels (can reach tens of megabytes), but also has the slowest speed, which is still significantly higher than the speed of RAM. The L3 cache memory is common to all processor cores. The L3 memory level is designed for temporary storage of those important data, the probability of accessing which is slightly lower than that of information stored in the first two levels L1, L2. It also ensures that the processor cores communicate with each other.

Some processor models are designed with two levels of cache memory, in which L2 combines all the functions of L2 and L3.

When a large cache size is useful.

You will feel a significant effect from a large cache volume when using archiver programs, in 3D games, during video processing and encoding. In relatively “light” programs and applications, the difference is practically unnoticeable (office programs, players, etc.).

How important is L3 cache for AMD processors?

Indeed, it makes sense to equip multi-core processors with dedicated memory that will be shared by all available cores. In this role, a fast third-level (L3) cache can significantly speed up access to data that is requested most often. Then the cores, if possible, will not have to access slow main memory (RAM).

At least in theory. Recently AMD announced the Athlon II X4 processor, which is a Phenom II X4 model without L3 cache, hinting that it is not that necessary. We decided to directly compare two processors (with and without L3 cache) to test how the cache affects performance.

Click on the picture to enlarge.

How does the cache work?

Before we dive into the tests, it's important to understand some basics. The principle of how the cache works is quite simple. The cache buffers data as close to the processing cores of the processor as possible to reduce CPU requests to more distant and slow memory. On modern desktop platforms, the cache hierarchy includes as many as three levels that precede access to RAM. Moreover, caches of the second and, in particular, third levels serve not only to buffer data. Their purpose is to prevent the processor bus from becoming overloaded when the cores need to exchange information.

Hits and misses

The effectiveness of cache architectures is measured by hit rate. Data requests that can be satisfied by the cache are considered hits. If this cache does not contain the necessary data, then the request is passed further along the memory pipeline, and a miss is counted. Of course, misses lead to more time required to obtain information. As a result, “bubbles” (idles) and delays appear in the computing pipeline. Hits, on the contrary, allow you to maintain maximum performance.

Cache entry, exclusivity, coherence

Replacement policies dictate how space is freed up in the cache for new entries. Because data written to the cache must eventually appear in main memory, systems may do so at the same time as writing to the cache (write-through), or may mark the data areas as "dirty" (write-back) and write to memory. when it is evicted from the cache.

Data in several cache levels can be stored exclusively, that is, without redundancy. Then you won't find the same data lines in two different cache hierarchies. Or caches can work inclusively, that is, the lower cache levels are guaranteed to contain data present in the upper cache levels (closer to the processor core). AMD Phenom uses an exclusive L3 cache, while Intel follows an inclusive cache strategy. Coherency protocols ensure the integrity and freshness of data across different cores, cache levels, and even processors.

Cache size

A larger cache can hold more data, but tends to increase latency. In addition, a large cache consumes a considerable number of processor transistors, so it is important to find a balance between the transistor budget, die size, power consumption and performance/latency.

Associativity

Entries in RAM can be directly mapped to the cache, that is, there is only one cache position for a copy of data from RAM, or they can be n-way associative, that is, there are n possible locations in the cache where this data may be stored. Higher degrees of associativity (up to fully associative caches) provide greater caching flexibility because existing data in the cache does not need to be rewritten. In other words, a high n-degree of associativity guarantees a higher hit rate, but it also increases latency because it takes more time to check all those associations for a hit. Typically, the highest degree of association is reasonable for the last level of caching, since the maximum capacity is available there, and searching for data outside of this cache will result in the processor accessing slow RAM.

Here are some examples: Core i5 and i7 use 32 KB of L1 cache with 8-way associativity for data and 32 KB of L1 cache with 4-way associativity for instructions. It's understandable that Intel wants instructions to be available faster and the L1 data cache to have a maximum hit rate. The L2 cache on Intel processors has 8-way associativity, and the Intel L3 cache is even smarter, since it implements 16-way associativity to maximize hits.

However, AMD is following a different strategy with the Phenom II X4 processors, which uses a 2-way associative L1 cache to reduce latency. To compensate for possible misses, the cache capacity was doubled: 64 KB for data and 64 KB for instructions. The L2 cache has 8-way associativity, like the Intel design, but AMD's L3 cache operates with 48-way associativity. But the decision to choose one cache architecture over another cannot be assessed without considering the entire CPU architecture. It is quite natural that test results have practical significance, and our goal was precisely a practical test of this entire complex multi-level caching structure.

Every modern processor has a dedicated cache that stores processor instructions and data, ready for use almost instantly. This level is commonly referred to as Level 1 or L1 cache and was first introduced in the 486DX processors. Recently AMD processors it became standard to use 64 KB of L1 cache per core (for data and instructions), and Intel processors use 32 KB of L1 cache per core (also for data and instructions)

L1 cache first appeared on the 486DX processors, after which it became an integral feature of all modern CPUs.

The second level cache (L2) appeared on all processors after the release of the Pentium III, although its first implementations on packaging were in Pentium processor Pro (but not on-chip). Modern processors are equipped with up to 6 MB of on-chip L2 cache. As a rule, this volume is divided between two cores on an Intel Core 2 Duo processor, for example. Typical L2 configurations provide 512 KB or 1 MB of cache per core. Processors with a smaller L2 cache tend to be at the lower price level. Below is a diagram of early L2 cache implementations.

The Pentium Pro had the L2 cache in the processor packaging. In subsequent generations of Pentium III and Athlon, the L2 cache was implemented through separate SRAM chips, which was very common at that time (1998, 1999).

The subsequent announcement of a process technology up to 180 nm allowed manufacturers to finally integrate L2 cache on the processor die.


The first dual-core processors simply used existing designs that included two dies per package. AMD introduced a dual-core processor on a monolithic chip, added a memory controller and a switch, and Intel simply assembled two single-core chips in one package for its first dual-core processor.


For the first time, the L2 cache began to be shared between two computing cores on Core processors 2 Duo. AMD went further and created its first quad-core Phenom from scratch, and Intel again used a pair of dies, this time two dual-core Core 2 dies, for its first quad-core processor to reduce costs.

The third level cache has existed since the early days of the Alpha 21165 processor (96 KB, processors introduced in 1995) or IBM Power 4 (256 KB, 2001). However, in x86-based architectures, the L3 cache first appeared with the Intel Itanium 2, Pentium 4 Extreme (Gallatin, both processors in 2003) and Xeon MP (2006) models.

Early implementations simply provided another level in the cache hierarchy, although modern architectures use the L3 cache as a large, shared buffer for inter-core data transfer in multi-core processors. This is emphasized by the high n-degree of associativity. It is better to look for data a little longer in the cache than to end up with a situation where several cores are using very slow access to main RAM. AMD first introduced L3 cache on a desktop processor with the already mentioned Phenom line. The 65 nm Phenom X4 contained 2 MB of shared L3 cache, and the modern 45 nm Phenom II X4 already has 6 MB of shared L3 cache. Intel Core i7 and i5 processors use 8 MB of L3 cache.

Modern quad-core processors have dedicated L1 and L2 caches for each core, as well as a large L3 cache shared by all cores. The shared L3 cache also allows for the exchange of data that the cores can work on in parallel.