Windows

What is level 3 cache on a processor. How does the cache work? A brief excursion into history

Popular Science,

Processors

The chips on most modern desktops have four cores, but chip manufacturers have already announced plans to move to six cores, and 16-core processors are far from uncommon for high-end servers today.

The more cores, the more problem distribution of memory between all cores while working together. With an increase in the number of cores, it is more and more profitable to minimize the loss of time on managing the cores during data processing - because the data exchange speed lags behind the speed of the processor and data processing in memory. You can physically access someone else's fast cache, or you can use your own slow one, but save on data transfer time. The task is complicated by the fact that the amount of memory requested by programs does not clearly correspond to the amount of cache memory of each type.

Physically, only a very limited amount of memory can be placed as close to the processor as possible - the cache of the L1 level processor, the volume of which is extremely insignificant. Daniel Sanchez, Po-An Tsai, and Nathan Beckmann are researchers at the Computer Science Lab and artificial intelligence Massachusetts Institute of Technology - taught how to configure a computer different types of its memory under a flexibly formed hierarchy of programs in real mode time. New system, called Jenga, analyzes the volume requirements and frequency of program memory accesses and redistributes the power of each of the 3 types of processor cache in combinations that provide increased efficiency and energy savings.

To begin with, the researchers tested the performance increase with a combination of static and dynamic memory when working on programs for a single-core processor and obtained a primary hierarchy - when which combination is better to use. From 2 types of memory or from one. Two parameters were evaluated - signal delay (latency) and energy consumed during the operation of each of the programs. Approximately 40% of programs began to work worse with a combination of types of memory, the rest - better. Having fixed which programs “like” mixed performance, and which ones like memory size, the researchers built their Jenga system.

They virtually tested 4 types of programs on virtual computer with 36 cores. Programs tested:

omnet - Objective Modular Network Testbed, C simulation library and network simulator platform (blue in figure)
mcf - Meta Content Framework (red color)
astar - display software virtual reality(green color)
bzip2 - archiver (purple)

The picture shows where and how the data of each of the programs was processed. The letters show where each application runs (one per quadrant), the colors show where its data resides, and the shading indicates the second level of the virtual hierarchy when present.

Cache levels

The CPU cache is divided into several levels. For universal processors - up to 3. Most fast memory is the first level cache - L1-cache, since it is located on the same chip as the processor. Consists of an instruction cache and a data cache. Some processors without L1 cache cannot function. The L1 cache operates at the processor frequency and can be accessed every clock cycle. It is often possible to perform multiple read/write operations at the same time. The volume is usually small - no more than 128 KB.

The L1 cache interacts with the second-level cache - L2. It is the second fastest. It is usually located either on-chip, like L1, or in close proximity to the core, such as in a processor cartridge. In older processors, the chipset on system board. The volume of L2 cache is from 128 KB to 12 MB. In modern multi-core processors, the second-level cache, located on the same chip, is a separate memory - with a total cache size of 8 MB, each core has 2 MB. Typically, the latency of the L2 cache located on the core chip is from 8 to 20 core cycles. In tasks involving numerous accesses to a limited memory area, for example, a DBMS, its full use gives a tenfold increase in performance.

The L3 cache is usually even larger, although somewhat slower than L2 (due to the fact that the bus between L2 and L3 is narrower than the bus between L1 and L2). L3 is usually located separately from the CPU core, but can be large - more than 32 MB. L3 cache is slower than previous caches, but still faster than RAM. In multiprocessor systems is in common use. The use of a third-level cache is justified in a very narrow range of tasks and may not only not provide an increase in performance, but vice versa and lead to a general decrease in system performance.

Disabling the second and third level cache is most useful in math problems when the amount of data is less than the size of the cache. In this case, you can load all the data at once into the L1 cache, and then process them.

From time to time, Jenga reconfigures virtual hierarchies at the OS level to minimize the amount of data exchange, taking into account resource constraints and application behavior. Each reconfiguration consists of four steps.

Jenga distributes data not only depending on which programs are dispatched - those who love large single-speed memory or those who love the speed of mixed caches, but also depending on the physical proximity of memory cells to the data being processed. Regardless of what type of cache the program requires by default or by hierarchy. The main thing is to minimize signal delay and power consumption. Depending on how many types of memory the program "likes", Jenga models the latency of each virtual hierarchy with one or two levels. Two-level hierarchies form a surface, one-level hierarchies form a curve. Jenga then projects the minimum delay in the dimensions of VL1, which results in two curves. Finally, Jenga uses these curves to select the best hierarchy (i.e. VL1 size).

The use of Jenga gave a tangible effect. The 36-core virtual chip is 30 percent faster and uses 85 percent less power. Of course, for now, Jenga is just a simulation of a running computer and it will be some time before you see real examples of this cache, and even before chip manufacturers adopt it if they like the technology.

Configuration of a conditional 36 nuclear machine

Processors. 36 cores, x86-64 ISA, 2.4 GHz, Silvermont-like LLC: 8B-wide
ifetch; 2-level bpred with 512x10-bit BHSRs + 1024x2-bit PHT, 2-way decode/issue/rename/commit, 32-entry IQ and ROB, 10-entry LQ, 16-entry SQ; 371pJ/instruction, 163mW/core static power
L1 caches. 32 KB, 8-way set-associative, split data and instruction caches,
3-cycle latency; 15/33 pJ per hit/miss
Prefetchers Service. 16-entry stream prefetchers modeled after and validated against
Nehalem
L2 caches. 128 KB private per-core, 8-way set-associative, inclusive, 6-cycle latency; 46/93 pJ per hit/miss
Coherent mode (Coherence). 16-way, 6-cycle latency directory banks for Jenga; in-cache L3 directories for others
Global NOC. 6×6 mesh, 128-bit flits and links, X-Y routing, 2-cycle pipelined routers, 1-cycle links; 63/71pJ per router/link flit traversal, 12/4mW router/link static power
Blocks of static memory SRAM. 18 MB, one 512 KB bank per tile, 4-way 52-candidate zcache, 9-cycle bank latency, Vantage partitioning; 240/500 pJ per hit/miss, 28 mW/bank static power
Multilayer Dynamic Memory Stacked DRAM. 1152MB, one 128MB vault per 4 tiles, Alloy with MAP-I DDR3-3200 (1600MHz), 128-bit bus, 16 ranks, 8 banks/rank, 2 KB row buffer; 4.4/6.2 nJ per hit/miss, 88 mW/vault static power
main memory. 4 DDR3-1600 channels, 64-bit bus, 2 ranks/channel, 8 banks/rank, 8 KB row buffer; 20 nJ/access, 4W static power
DRAM timings. tCAS=8, tRCD=8, tRTP=4, tRAS=24, tRP=8, tRRD=4, tWTR=4, tWR=8, tFAW=18 (all timings in tCK; stacked DRAM has half the tCK as main memory )

CPU- (English "central processing unit", CPU - central processing unit) - a microcircuit designed for processing program code and defining the basic functions of a computer related to information processing. The processor performs logical and arithmetic operations, manages computing processes and coordinates the operation of the system devices.

The central processing unit is a chip that is located on the motherboard or installed in it.

Consider the main characteristics of the microprocessor.

Core is the core component of the CPU and determines most of the CPU parameters, such as socket type, operating frequency range, and Internal Data Bus (FSB) frequency.

The processor core has the following characteristics, which will be discussed below: the volume of the internal cache of the first and second levels, voltage, heat dissipation, etc.

Number of cores - the number of cores in the CPU. Increasing the number of CPU cores increases its performance.

Cache size - first level memory (L1 cache)

The L1 cache is a high-speed memory with a capacity of 8 to 128 KB, into which data is copied from RAM. The cache block is located on the processor core. Due to the fact that the processing of data in the cache memory is faster than data from RAM, storing the main instructions in the cache allows you to increase the performance of the CPU. For multi-core processors L1 cache size is specified for one core.

L2 cache size (L2 cache)

The L2 cache is a high-speed memory designed for the same purposes as the first level cache, but it has a larger capacity - from 128 to 12288 KB. To solve resource-intensive tasks, processors with a large amount of second-level cache memory are designed. Multi-core CPUs are characterized by the total amount of L2 cache.

L3 cache size (L3 cache) is in the range from 0 to 16384 KB.

The integrated L3 cache and the system bus together form a high-speed channel for data exchange with the system memory. L3 cache memory is mainly equipped only with processors designed to complete a server computer. L3 cache is equipped with processor lines such as Itanium 2, Intel Pentium 4 Extreme Edition, Xeon DP etc.

socket- connector for installing a microprocessor on motherboard. The socket type is determined by the processor manufacturer and the number of pins. Different CPUs correspond different types sockets.

Clock frequency processor (MHz) - the number of operations (cycles) that the processor performs per second. The higher this figure, the more productive the processor. But it should be remembered that this is true only for the CPU of one manufacturer, since in addition to the clock frequency, the following parameters affect the performance of the microprocessor: L2 cache size, L3 cache frequency, etc. The processor frequency is proportional to the FSB (bus frequency).

Bus frequency(Front Side Bus - FSB) is clock frequency, with which data is exchanged between the system bus and the processor.

Data bus- this is a set of signal lines designed to exchange processor data with internal devices computer.

Heat dissipation(eng. TDP - thermal design power) - a parameter that determines how much power must be removed to the cooling system to ensure normal functioning processor. Values for this parameter range from 10 to 165 watts. It is only correct to compare heat dissipation for processors from the same manufacturer, since each manufacturer defines heat dissipation differently.

Virtualization Technology support
With the help of the Virtualization Technology function, it became possible to load several operating systems at the same time on one computer.

AMD64/EM64T technology support
Thanks to this technology, microprocessors with 64-bit architecture are able to work with 32-bit and 64-bit applications equally effectively. Lines of processors with 64-bit architecture: AMD Athlon 64, AMD Opteron, Core 2 Duo, Intel Xeon 64 and others. CPUs that support 64-bit addressing are able to work with more than 4 GB of RAM, which is not available for 32-bit processors. The implementation of 64-bit extensions in AMD CPUs is called AMD64, and for Intel it is called EM64T.

The maximum operating temperature (from 54.8 to 105 C) is the maximum allowable processor temperature at which normal operation is possible. The operating temperature of the CPU depends on its workload and the quality of cooling. At low load and normal heat dissipation, the temperature of the processor is 25-40 °C, and at high load it is up to 60-70 °C. Processors with high operating temperatures require cooling systems that provide efficient heat dissipation.

Core voltage - an indicator that determines the voltage on the processor core required by the processor to work. The core voltage ranges from 0.65 to 165 watts.

What is the dirtiest place on the computer? Think basket? User folders? Cooling system? Didn't guess! The dirtiest place is the cache! After all, it constantly has to be cleaned!

In fact, there are many caches on a computer, and they serve not as a waste dump, but as accelerators for equipment and applications. Where does their reputation as a "systemic garbage chute" come from? Let's see what a cache is, how it happens, how it works and why from time to time.

The concept and types of cache memory

Esh or cache memory is a special storage of frequently used data, which is accessed tens, hundreds and thousands of times faster than RAM or other storage media.

Applications (web browsers, audio and video players, database editors, etc.), operating system components (thumbnail cache, DNS cache) and hardware (CPU L1-L3 cache, GPU framebuffer, etc.) have their own cache memory. chip, drive buffers). It is implemented in different ways - software and hardware.

The program cache is just a separate folder or file where, for example, pictures, menus, scripts, multimedia content and other content of visited sites are downloaded. This is the folder where the browser first dives when you open a web page again. Swapping a piece of content from local storage speeds up its loading and .

In hard drives, in particular, the cache is a separate RAM chip with a capacity of 1-256 Mb, located on the electronics board. It receives information read from the magnetic layer and not yet loaded into RAM, as well as data that is most often requested operating system.

A modern central processor contains 2-3 main levels of cache memory (it is also called scratch memory), placed in the form of hardware modules on the same chip. The fastest and smallest in volume (32-64 Kb) is cache Level 1 (L1) - it runs at the same frequency as the processor. L2 is in the middle position in terms of speed and capacity (from 128 Kb to 12 Mb). And L3 is the slowest and most voluminous (up to 40 Mb), it is absent on some models. The speed of L3 is only low relative to its faster counterparts, but it is also hundreds of times faster than the most productive RAM.

The scratchpad memory of the processor is used to store constantly used data, pumped from RAM, and machine code instructions. The larger it is, the faster the processor.

Today, three levels of caching is no longer the limit. With the advent of the Sandy Bridge architecture, Intel has implemented an additional cache L0 (intended for storing decrypted microinstructions) in its products. And the most high-performance CPUs also have a fourth-level cache, made in the form of a separate microcircuit.

Schematically, the interaction of the levels cache L0-L3 looks like this (using the example Intel Xeon):

Human language about how it all works

To understand how cache memory works, imagine a person working at a desk. Folders and documents that he uses all the time are on the table ( in cache). To access them, just reach out your hand.

The papers he needs less often are stored nearby on the shelves ( in RAM). To get them, you need to get up and walk a few meters. And what a person does not currently work with has been archived ( recorded on hard disk).

The wider the table, the more documents will fit on it, which means that the employee will be able to receive fast access to more information the larger the cache capacity, the faster the program or device works in theory).

Sometimes he makes mistakes - he keeps papers on the table that contain incorrect information and uses them in his work. As a result, the quality of his work is reduced ( cache errors lead to software and hardware failures). To correct the situation, the employee must throw away the documents with errors and put the correct ones in their place ( clear cache memory).

The table has a limited area ( cache memory is limited). Sometimes it can be expanded, for example, by moving a second table, and sometimes it cannot (the cache size can be increased if such an opportunity is provided by the program; the hardware cache cannot be changed, since it is implemented in hardware).

Another way to speed up access to more documents than the table can hold is to find an assistant who will serve paper to the worker from the shelf (the operating system can allocate some of the unused RAM to cache device data). But it's still slower than taking them off the table.

Documents at hand should be relevant for current tasks. This is the responsibility of the employee himself. You need to clean up the papers regularly (the extrusion of irrelevant data from the cache memory falls "on the shoulders" of applications that use it; some programs have an automatic cache clearing function).

If an employee forgets to maintain order in the workplace and keep documentation up to date, he can draw a table cleaning schedule for himself and use it as a reminder. As a last resort, entrust this to an assistant (if an application dependent on cache memory has become slower or often loads outdated data, use scheduled cache cleaning tools or do this manually every few days).

We actually come across "caching functions" all over the place. This is the purchase of products for the future, and various actions that we perform in passing, at the same time, etc. In fact, this is everything that saves us from unnecessary fuss and unnecessary body movements, streamlines life and facilitates work. The computer does the same. In a word, if there was no cache, it would work hundreds and thousands of times slower. And we wouldn't like it.

More on the site:

What is a cache, why is it needed and how does it work updated: February 25, 2017 by: Johnny Mnemonic

Cache memory (Cache)- an array of ultra-fast RAM, which is a buffer between the controller system memory and . This buffer stores blocks of data with which it works in this moment, thereby significantly reducing the number of processor accesses to slow system memory. Thus, it significantly increases overall performance processor.

Distinguish cache memory of the 1st, 2nd and 3rd levels (marked L1, L2 and L3).

First level cache (L1)- the fastest, but smaller in volume than the rest. The processor core works directly with it. Level 1 cache has the lowest latency (access time).
Second level cache (L2)- the amount of this memory is much larger than the cache memory of the first level.
Cache memory of the third level (L3)– Larger cache memory and slower than L2.

In the classic version, there were 2 levels of cache memory - the 1st and the second level. The 3rd level is different in organization from the 2nd level cache. If the data was not processed or the processor needs to process urgent data, then the data is moved to the level 3 cache to free the L2 cache. L3 cache is larger, however, and slower than L2 (the bus between L2 and L3 is narrower than the bus between L1 and L2), but still its speed is much faster than the system memory.

All data is initially transferred to the cache memory of the 2nd level, for processing by the central processor, the data is partially decoded and transferred further to the core.

In the cache memory of the 2nd level, a chain of instructions is built from the data, and in the cache of the 1st level, internal instructions of the processor are "mirrored", which take into account the features of the processor, registers, etc. Number internal commands there is not too much central processor, so the size of the 1st level cache does not matter much (in modern processors, the cache memory of the 1st level L1 can be from 64 KB, 128 KB for each of the cores). Unlike the L1 cache, the L2 cache is of great importance for the processor, which is why processors with the largest L2 cache show high performance.

There are differences in the organization of the cache memory structure for processors. For example, AMD processors are clearly divided between the cache memory cores, and are marked accordingly - 512x2 ( Athlon 5200 and below) or 1024x2 (for Athlon 5200 and higher). But for processors Intel Core 2duo the cache is not strictly divided, which means that for each of the cores you can use the required amount of shared cache memory, this is well suited for systems that do not support multi-core. If you use all cores, the cache memory is divided into each of the cores dynamically, depending on the load of each of the cores.

It's not about cash, it's about cache-memory of processors and not only. Out of volume cache-memory traders have made another commercial fetish, especially with the cache CPUs and hard drives(video cards also have it - but they haven't gotten to it yet). So, there is a XXX processor with a 1MB L2 cache, and exactly the same XYZ processor with a 2MB cache. Guess which one is better? Ah - don't do it right away!

Cache-memory is a buffer where things that can and / or need to be put aside for later are added. The processor does the work and situations arise when intermediate data needs to be stored somewhere. Well, of course in the cache! - after all, it is orders of magnitude faster than RAM, tk. it is in the processor die itself and usually runs at the same frequency. And then, after some time, he will fetch this data back and process it again. Roughly speaking, like a potato sorter on a conveyor, which every time something other than a potato (carrot) comes across, throws it into a box. And when he is full, he gets up and takes out his to the next room. At this moment, the conveyor stops and idle is observed. The volume of the box is cache in this analogy. AND how his Do you need 1MB or 12? It is clear that if his the volume is small, it will take too much time to take out and it will be simple, but from some volume his further increase will not give anything. Well, the sorter will have a box for 1000 kg of carrots - yes, he will not have so much of it for the entire shift, and this will NOT make him TWO TIMES FASTER! There is another subtlety - a large cache can cause an increase in delays in accessing it, firstly, and at the same time, the probability of errors in it increases, for example, during overclocking - secondly. (about HOW in this case to determine the stability / instability of the processor and find out that the error occurs precisely in his cache, test L1 and L2 - you can read here.) Thirdly - cache survives a decent area of the crystal and the transistor budget of the processor circuit. The same goes for cache hard drive memory. And if the processor architecture is strong, it will require 1024Kb of cache or more in many applications. If you have fast HDD– 16Mb or even 32Mb are appropriate. But no 64MB of cache will do his faster if this is a cut called the green version (Green WD) with a speed of 5900 instead of the prescribed 7200, even if the latter has 8MB. Later Intel processors and AMD use this differently cache(generally speaking, AMD is more efficient and their processors are often comfortable with lower values). In addition, Intel cache general, but for AMD it is personal for each core. The fastest cache L1 y AMD processors is 64Kb each for data and instructions, twice that of Intel. Cache the third level L3 is usually present in top processors like AMD Phenom II 1055T X6 Socket AM3 2.8GHz or a competitor Intel Core i7-980X. First of all, games love large amounts of cache. AND cache DON'T like many professional applications (see Computer for rendering, video editing and professional applications). More precisely, the most demanding ones are generally indifferent to it. But what you definitely shouldn’t do is choose a processor by cache size. The old Pentium 4 in its latest manifestations had 2MB of cache at operating frequencies far beyond 3GHz - compare his performance with a cheap dual-core Celeron E1*** operating at frequencies around 2GHz. He will not leave stone unturned from the old man. A more recent example is the high-frequency dual-core E8600 that costs almost $200 (apparently because of the 6MB cache) and the Athlon II X4-620 2.6GHz, which has only 2MB. This does not prevent Athlone from butchering a competitor for a nut.

As you can see on the graphs, neither in complex programs, nor in processor-demanding games, there is any cache will not replace additional cores. Athlon with 2MB cache (red) easily outperforms Cor2Duo with 6MB cache even at a lower frequency and nearly half the cost. Also, many people forget that cache is present in video cards, because they, generally speaking, also have processors. A recent example is the GTX460 video card, where they manage not only to cut the bus and the amount of memory (which the buyer will guess) - but also CACHE shaders, respectively, from 512Kb to 384Kb (which the buyer will NOT guess about). And this will also add its negative contribution to performance. It will also be interesting to find out the dependence of performance on cache size. Let's examine how fast it grows with an increase in the cache size using the example of the same processor. As you know, processors of the E6***, E4*** and E2*** series differ only in cache size (4, 2 and 1 MB respectively). Working at the same frequency of 2400 MHz, they show the following results.

As you can see, the results are not too different. I will say more - if a processor with a capacity of 6 MB were involved - the result would increase by a little more, because. processors reach saturation. But for models with 512Kb, the drop would be noticeable. In other words, 2MB is enough even in games. Summarizing, we can conclude that - cache it's good when ALREADY a lot of everything else. It is naive and stupid to change the speed of the hard drive or the number of processor cores per cache size at the same cost, because even the most capacious sorting box cannot replace another sorter. good examples.. For example, the Pentium Dual-Core in the early revision of the 65nm process had 1MB of cache for two cores (E2160 series and the like), and the later 45nm revision of the E5200 series and further has 2MB, all other things being equal (and most importantly, the PRICE ). Of course, it is worth choosing the latter.