Games

Implementation of a multi-threaded game engine architecture. Operating systems and topology

Having dealt with the theory of multithreading, let's consider a practical example - Pentium 4. Already at the development stage of this processor, Intel engineers continued to work on increasing its performance without making changes to the program interface. Five simple methods were considered:
1. Increasing the clock frequency.
2. Placing two processors on one chip.
3. Introduction of new functional blocks.
1. Extension of the conveyor.
2. Using multithreading.
The most obvious way to improve performance is to increase clock frequency without changing other parameters. As a rule, each subsequent processor model has a slightly higher clock speed than the previous one. Unfortunately, with a straight-line increase in clock speed, developers are faced with two problems: increased power consumption (which is relevant for laptop computers and other battery-powered computing devices) and overheating (which requires more efficient heat sinks).
The second method - placing two processors on a chip - is relatively simple, but it involves doubling the area occupied by the chip. If each processor is provided with its own cache, the number of chips per wafer is halved, but this also means twice the cost of production. If both processors have a shared cache memory, a significant increase in the occupied area can be avoided, but in this case another problem arises - the amount of cache memory per processor is halved, and this inevitably affects performance. In addition, while professional server applications are capable of fully utilizing the resources of multiple processors, ordinary desktop programs have much less internal parallelism.
The introduction of new functional blocks is also not difficult, but it is important to strike a balance here. What's the point of a dozen ALUs if the chip can't issue commands to the pipeline at a rate that can load all those blocks?
A pipeline with an increased number of stages, capable of dividing tasks into smaller segments and processing them in short periods of time, on the one hand, improves performance, on the other hand, increases the negative consequences of misprediction of transitions, cache misses, interrupts and other events that disrupt the normal course instruction processing in the processor. In addition, in order to fully realize the capabilities of the extended pipeline, it is necessary to increase the clock frequency, and this, as we know, leads to increased power consumption and heat dissipation.
Finally, you can implement multithreading. The advantage of this technology is the introduction of an additional software stream that allows you to put into operation those hardware resources that are otherwise would stand still. Based on the results of experimental studies, Intel developers found that a 5% increase in chip area when implementing multithreading for many applications gives a performance increase of 25%. The first multi-threaded Intel processor was the 2002 Xeon. Subsequently, starting at 3.06 GHz, multithreading was introduced into the Pentium 4 line. Intel calls the implementation of multithreading in the Pentium 4 hyperthreading.
The main principle of hyperthreading is the simultaneous execution of two program threads (or processes - the processor does not distinguish between processes and program threads). The operating system treats the Pentium 4 hyper-threaded processor as a two-processor complex with shared caches and main memory. The operating system performs scheduling for each program thread separately. Thus, two applications can be running at the same time. For example, a mailer might send or receive messages in the background while the user is interacting with an interactive application—that is, a daemon and user program are executed simultaneously, as if two processors were available to the system.
Application programs capable of executing in multiple threads can use both "virtual processors". For example, video editing programs typically allow users to apply filters to all frames. Such filters correct brightness, contrast, color balance and other properties of frames. In such a situation, the program can assign one virtual processor to process even frames, and another to process odd ones. In this case, the two processors will work completely independently of each other.
Since the software threads access the same hardware resources, coordination of these threads is necessary. In the context of hyperthreading, Intel developers have identified four useful resource sharing management strategies: resource duplication, and hard, threshold, and full resource sharing. Let's take a look at these strategies.
Let's start with resource duplication. As you know, some resources are duplicated in order to organize program flows. For example, since each program thread needs individual control, a second program counter is needed. In addition, it is necessary to introduce a second table for mapping architectural registers (EAX, EBX, etc.) to physical registers; similarly, the interrupt controller is duplicated, since interrupt handling for each thread is done individually.
This is followed by a technique for hard partitioning of resources (partitioned resource sharing) between program threads. For example, if the processor has a queue between two functional stages of the pipeline, then half of the slots can be given to thread 1, the other half to thread 2. Resource sharing is easy to implement, does not lead to imbalance, and ensures complete independence of program threads from each other. With the complete separation of all resources, one processor actually turns into two. On the other hand, there may be a situation in which one program thread does not use resources that could be useful to the second thread, but for which it does not have access rights. As a result, resources that could otherwise be used are idle.
The opposite of hard sharing is full resource sharing. In this scheme to the right resources any program thread can access, and they are serviced in the order in which access requests are received. Let us consider a situation in which a fast stream, consisting mainly of addition and subtraction operations, coexists with a slow stream, which implements multiplication and division operations. If instructions are called from memory faster than the multiplication and division operations are performed, the number of instructions called within the slow thread and queued on the pipeline will gradually increase. Ultimately, these commands will fill the queue, as a result, the fast stream will stop due to lack of space in it. Complete resource sharing solves the problem of non-optimal use of shared resources, but creates an imbalance in their consumption - one thread can slow down or stop another.
The intermediate scheme is implemented within the framework of threshold resource sharing. According to this scheme, any program thread can dynamically receive a certain (limited) amount of resources. When applied to replicated resources, this approach provides flexibility without the threat of one of the program threads being idle due to the inability to obtain resources. If, for example, each of the threads is prohibited from occupying more than 3/4 of the command queue, the increased consumption of resources by the slow thread will not interfere with the execution of the fast one.
The Pentium 4's hyperthreading model integrates different resource sharing strategies. Thus, an attempt is made to solve all the problems associated with each strategy. Duplication is implemented with respect to resources that both program threads constantly need access to (in particular, with respect to the program counter, the register mapping table, and the interrupt controller). Duplication of these resources increases the area of the microcircuit by only 5% - you will agree that it is quite a reasonable price for multithreading. Resources that are available in such a volume that it is practically impossible for them to be captured by a single thread (for example, cache lines) are allocated dynamically. Access to resources that control the operation of the pipeline (in particular, its many queues) is divided - each program thread is given half the slots. The main pipeline of the Netburst architecture implemented in the Pentium 4 is shown in Fig. 8.7; the white and gray areas in this illustration represent the resource allocation mechanism between the white and gray program threads.
As you can see, all the queues in this illustration are divided - each program thread is allocated half the slots. Neither thread can restrict the work of the other. The block of distribution and substitution is also divided. Scheduler resources are shared dynamically, but based on some threshold value - thus, no thread can occupy all slots of the queue. For all other stages of the conveyor, there is a complete separation.
However, multithreading is not so simple. Even this advanced technique has its drawbacks. Hard partitioning of resources is not associated with serious costs, but dynamic partitioning, especially with regard to thresholds, requires monitoring resource consumption at runtime. In addition, in some cases programs work much better without multithreading than with it. Suppose, for example, that if there are two program threads, each of them needs 3/4 of the cache to function properly. If they were executed in turn, each would perform reasonably well with a small number of cache misses (which are known to be associated with additional overhead). In the case of parallel execution, there would be significantly more cache misses for each, and the end result would be worse than without multithreading.
For more information on the RepPit 4 multithreading mechanism, see .

But with the conquest of new peaks in frequency indicators, it became more difficult to increase it, as this affected the increase in TDP of processors. Therefore, the developers began to grow processors in width, namely, to add cores, and the concept of multi-core arose.

Literally 6-7 years ago, multi-core processors were practically unheard of. No, multi-core processors from the same IBM company existed before, but the appearance of the first dual-core processor for desktop computers, took place only in 2005, and was called given processor Pentium D. Also, in 2005, AMD's dual-core Opteron was released, but for server systems.

In this article, we will not delve into historical facts in detail, but will discuss modern multi-core processors as one of the characteristics of the CPU. And most importantly - we need to figure out what this multi-core gives in terms of performance for the processor and for you and me.

Increased performance with multi-core

The principle of increasing processor performance due to several cores is to split the execution of threads (various tasks) into several cores. In summary, almost every process running on your system has multiple threads.

I’ll make a reservation right away that the operating system can virtually create many threads for itself and do it all at the same time, even if the processor is physically single-core. This principle implements the same Windows multitasking (for example, listening to music and typing at the same time).

Let's take for example antivirus program. One thread we will scan the computer, the other will update anti-virus database(we've simplified it a lot in order to understand the general concept).

And consider what will happen in two different cases:

a) Single core processor. Since two threads are running at the same time, we need to create for the user (visually) this very simultaneity of execution. The operating system does tricky:there is a switch between the execution of these two threads (these switches are instantaneous and the time is in milliseconds). That is, the system “performed” the update a little, then abruptly switched to scanning, then back to updating. Thus, for you and me, it seems that these two tasks are being carried out simultaneously. But what is being lost? Of course, performance. So let's look at the second option.

b) The processor is multi-core. In this case, this switch will not occur. The system will clearly send each thread to a separate core, which as a result will allow us to get rid of the switching from thread to thread that is detrimental to performance (let's idealize the situation). Two threads run simultaneously, this is the principle of multi-core and multi-threading. Ultimately, we will perform scans and updates much faster on a multi-core processor than on a single-core one. But there is a catch - not all programs support multi-core. Not every program can be optimized this way. And everything happens far from being as perfect as we have described. But every day, developers create more and more programs whose code is perfectly optimized for execution on multi-core processors.

Are multi-core processors necessary? Everyday reasonableness

At choice of processor for a computer (namely, when thinking about the number of cores), one should determine the main types of tasks that it will perform.

To improve knowledge in the field of computer hardware, you can read the material about processor sockets .

The starting point can be called dual-core processors, since it makes no sense to return to single-core solutions. But dual-core processors are different. It may not be the "most" fresh Celeron, or it may be a Core i3 on Ivy Bridge, just like AMD - Sempron or Phenom II. Naturally, due to other indicators, their performance will be very different, so you need to look at everything comprehensively and compare multi-core with others. processor characteristics.

For example, the Core i3 on Ivy Bridge has Hyper-Treading technology, which allows you to process 4 threads simultaneously (the operating system sees 4 logical cores, instead of 2 physical ones). And the same Celeron does not boast of such.

But let's return directly to the reflections on the required tasks. If a computer is required for office work and surfing the Internet, then a dual-core processor is enough for him.

When it comes to gaming performance, you need 4 cores or more to be comfortable in most games. But here the very catch pops up: not all games have optimized code for 4-core processors, and if they are optimized, it is not as efficient as we would like. But, in principle, for games now the optimal solution is precisely the 4th core processor.

To date, the same 8-core AMD processors, are redundant for games, it is the number of cores that is redundant, but the performance is not up to par, but they have other advantages. These same 8 cores will help a lot in tasks where powerful work with a high-quality multi-threaded load is needed. This includes, for example, rendering (calculation) of video, or server computing. Therefore, for such tasks, 6, 8 or more cores are needed. And soon games will be able to load 8 or more cores with high quality, so in the future, everything is very rosy.

Do not forget that there are still a lot of tasks that create a single-threaded load. And you should ask yourself the question: do I need this 8-core or not?

Summing up a little, I would like to note once again that the advantages of multi-core are manifested during "heavy" computational multi-threaded work. And if you do not play games with exorbitant requirements and do not do specific types of work that require good computing power, then spending money on expensive multi-core processors simply does not make sense (

tutorial

In this article, I will try to describe the terminology used to describe systems capable of executing multiple programs in parallel, i.e. multi-core, multi-processor, multi-threaded. Different types parallelism in the IA-32 CPU appeared at different times and in a somewhat inconsistent order. It's easy to get confused in all of this, especially given that operating systems are careful to hide details from less sophisticated application programs.

The purpose of the article is to show that with all the variety of possible configurations of multiprocessor, multi-core and multi-threaded systems for programs running on them, opportunities are created both for abstraction (ignoring differences) and for taking into account specifics (the ability to programmatically learn the configuration).

Warning about signs ®, ™, in the article

Mine explains why company employees should use copyright marks in public communications. In this article, they had to be used quite often.

CPU

Of course, the oldest, most commonly used and ambiguous term is "processor".

AT modern world the processor is that (package) that we buy in a beautiful Retail box or not very beautiful OEM package. An indivisible entity inserted into a socket on a motherboard. Even if there is no connector and it cannot be removed, that is, if it is tightly soldered, this is one chip.

Mobile systems (phones, tablets, laptops) and most desktops have a single processor. Workstations and servers sometimes boast two or more processors on the same motherboard.

Support for multiple CPUs in a single system requires numerous design changes. At a minimum, it is necessary to provide their physical connection (provide for several sockets on the motherboard), solve the issues of processor identification (see later in this article, as well as my note), coordinating memory accesses and delivering interrupts (the interrupt controller must be able to route interrupts to multiple processors) and, of course, support from the operating system. Unfortunately, I could not find a documented mention of the moment the first multiprocessor system was created on Intel processors, however Wikipedia claims that Sequent Computer Systems supplied them already in 1987 using Intel 80386 processors. Widespread support for several chips in one system becomes available starting with the Intel® Pentium.

If there are several processors, then each of them has its own connector on the board. At the same time, each of them has complete independent copies of all resources, such as registers, executing devices, caches. They share a common memory - RAM. Memory can be connected to them in various and rather non-trivial ways, but this is a separate story that is beyond the scope of this article. The important thing is that in any case, the executable programs must create the illusion of a homogeneous shared memory, accessible from all processors in the system.

Ready for takeoff! Intel® Desktop Board D5400XS

Core

Historically, multi-core in the Intel IA-32 appeared later than Intel® HyperThreading, but it comes next in the logical hierarchy.

It would seem that if there are more processors in the system, then its performance is higher (on tasks that can use all resources). However, if the cost of communications between them is too high, then all the gain from parallelism is killed by long delays in the transfer of common data. This is exactly what is observed in multiprocessor systems - both physically and logically they are very far from each other. To communicate effectively in such conditions, specialized buses such as the Intel® QuickPath Interconnect have to be invented. Energy consumption, size and price of the final solution, of course, do not decrease from all this. High integration of components should come to the rescue - circuits that perform parts parallel program, you need to drag them closer to each other, preferably by one crystal. In other words, one processor should organize several nuclei, identical to each other in everything, but working independently.

The first IA-32 multi-core processors from Intel were introduced in 2005. Since then, the average number of cores in server, desktop, and now mobile platforms is growing steadily.

Unlike two single-core processors in the same system, which share only memory, two cores can also share caches and other resources responsible for interacting with memory. Most often, the caches of the first level remain private (each core has its own), while the second and third levels can be either shared or separate. This organization of the system reduces the delay in data delivery between neighboring cores, especially if they are working on a common task.

Micrograph of a quad-core Intel processor codenamed Nehalem. Separate cores, a shared L3 cache, as well as QPI links to other processors and a shared memory controller are highlighted.

hyperthread

Until about 2002, the only way to get an IA-32 system capable of executing two or more programs in parallel was to use multiprocessor systems specifically. The Intel® Pentium® 4, as well as the Xeon line codenamed Foster (Netburst), introduced new technology- hyperthreads or hyperthreads, - Intel® HyperThreading (hereinafter HT).

There is nothing new under the sun. HT is a special case of what is referred to in the literature as simultaneous multithreading (SMT). Unlike "real" cores, which are complete and independent copies, in the case of HT, only a part of internal nodes are duplicated in one processor, primarily responsible for storing the architectural state - registers. The executive nodes responsible for organizing and processing data remain in the singular, and at any time are used by at most one of the threads. Like cores, hyperthreads share caches among themselves, but starting at what level depends on the specific system.

I won't try to explain all the pros and cons of SMT designs in general and HT designs in particular. The interested reader can find a fairly detailed discussion of the technology in many sources, and of course Wikipedia. However, I note the following important point, explaining the current limits on the number of hyperthreadings in real production.

Thread Limits

In what cases is the presence of "dishonest" multi-core in the form of HT justified? If one application thread is not able to load all the executing nodes inside the kernel, then they can be "borrowed" to another thread. This is typical for applications that have a "bottleneck" not in calculations, but in data access, that is, often generating cache misses and having to wait for data to be delivered from memory. During this time, the kernel without HT will be forced to idle. The presence of HT allows you to quickly switch free executing nodes to another architectural state (because it is just duplicated) and execute its instructions. This is a special case of a trick called latency hiding, when one long operation, during which useful resources are idle, is masked by the parallel execution of other tasks. If the application already has a high degree of utilization of kernel resources, the presence of hyperthreading will not allow for acceleration - "honest" kernels are needed here.

Typical scenarios for desktop and server applications designed for general purpose machine architectures have the potential for parallelism implemented using HT. However, this potential is quickly "used up". Perhaps for this reason, on almost all IA-32 processors, the number of hardware hyperthreads does not exceed two. In typical scenarios, the gain from using three or more hyperthreadings would be small, but the loss in die size, power consumption, and cost is significant.

Another situation is observed in typical tasks performed on video accelerators. Therefore, these architectures are characterized by the use of SMT technology with a larger number of threads. Since the Intel® Xeon Phi coprocessors (introduced in 2010) are ideologically and genealogically quite close to video cards, they may have four hyperthreading on each core - a configuration unique to the IA-32.

logical processor

Of the three "levels" of parallelism described (processors, cores, hyperthreadings), some or even all of them may be missing in a particular system. This is influenced BIOS settings(multi-core and multi-threading are disabled independently), microarchitecture considerations (for example, HT was absent from the Intel® Core™ Duo, but was brought back with the release of Nehalem), and system events (multi-processor servers can turn off failed processors in case of malfunctions and continue to fly) on the rest). How is this multi-layered zoo of concurrency visible to the operating system and, ultimately, to applications?

Further, for convenience, we denote the number of processors, cores, and threads in some system by a triple ( x, y, z), where x is the number of processors y is the number of cores in each processor, and z is the number of hyperthreads in each core. Hereafter, I will refer to this trio topology- an established term that has little to do with the section of mathematics. Work p = xyz defines the number of entities named logical processors systems. It defines the total number of independent application process contexts in a shared-memory system executing in parallel that the operating system has to consider. I say "forced" because it cannot control the execution order of two processes that are on different logical processors. This also applies to hyperthreads: although they run "sequentially" on the same core, the specific order is dictated by the hardware and is not visible or controlled by programs.

Most often, the operating system hides from the end applications the features of the physical topology of the system on which it is running. For example, the following three topologies: (2, 1, 1), (1, 2, 1) and (1, 1, 2) - the OS will be represented as two logical processors, although the first of them has two processors, the second one has two cores, and the third one is just two threads.

Windows task manager shows 8 logical processors; but how much is that in processors, cores and hyperthreads?

Linux top shows 4 logical processors.

This is quite convenient for the creators of applied applications - they do not have to deal with hardware features that are often insignificant for them.

Software definition of topology

Of course, abstracting the topology into a single number of logical processors in some cases creates enough grounds for confusion and misunderstanding (in heated Internet disputes). Computing applications that want to get the most performance out of hardware require fine-grained control over where their threads will be placed: closer together on adjacent hyperthreads, or vice versa, further away on different processors. The speed of communication between logical processors within the same core or processor is much faster than the speed of data transfer between processors. The possibility of heterogeneity in the organization of RAM also complicates the picture.

Information about the system topology as a whole, as well as the position of each logical processor in the IA-32, is available using the CPUID instruction. Since the advent of the first multiprocessor systems, the logical processor identification scheme has been extended several times. To date, parts of it are contained in sheets 1, 4 and 11 of the CPUID. Which of the sheets to watch can be determined from the following block diagram, taken from the article:

I will not bore here with all the details of the individual parts of this algorithm. If there is interest, then the next part of this article can be devoted to this. I will refer the interested reader to, in which this issue is analyzed in as much detail as possible. Here I will first briefly describe what APIC is and how it relates to topology. Then consider working with sheet 0xB (eleven in decimal), which is on this moment is the last word in "apico-building".

APIC ID

Local APIC (advanced programmable interrupt controller) is a device (now part of the processor) responsible for working with interrupts coming to a specific logical processor. Each logical processor has its own APIC. And each of them in the system must have unique value APICID. This number is used by interrupt controllers for addressing when delivering messages, and by everyone else (such as the operating system) to identify logical processors. The specification for this interrupt controller has evolved from the Intel 8259 PIC through Dual PIC, APIC and xAPIC to x2APIC.

At the moment, the width of the number stored in the APIC ID has reached the full 32 bits, although in the past it was limited to 16, and even earlier to only 8 bits. Today, remnants of the old days are scattered all over the CPUID, but all 32 bits of the APIC ID are returned in CPUID.0xB.EDX. Each logical processor independently executing the CPUID instruction will return a different value.

Clarification of family ties

The APIC ID value by itself says nothing about the topology. To find out which two logical processors are inside the same physical one (i.e., they are "brothers" of hyperthreads), which two are inside the same processor, and which are completely different processors, you need to compare their APIC ID values. Depending on the degree of relationship, some of their bits will match. This information is contained in the sublists CPUID.0xB, which are encoded with an operand in ECX. Each of them describes the position of the bit field of one of the topology levels in EAX (more precisely, the number of bits that need to be shifted in the APIC ID to the right to remove the lower levels of the topology), as well as the type of this level - hyperthread, core or processor - in ECX.

Logical processors within the same core will match all APIC ID bits except for those in the SMT field. For logical processors that are in the same processor, all bits except for the Core and SMT fields. Since the number of subsheets for CPUID.0xB can grow, this scheme will allow to support the description of topologies with a larger number of levels, if the need arises in the future. Moreover, it will be possible to introduce intermediate levels between existing ones.

An important consequence of the organization of this scheme is that in the set of all APIC IDs of all logical processors of the system there can be "holes", i.e. they will not go sequentially. For example, in a multi-core processor with HT disabled, all APIC IDs may turn out to be even, since the least significant bit responsible for encoding the hyperthread number will always be zero.

Note that CPUID.0xB is not the only source of information about logical processors available to the operating system. The list of all processors available to it, along with their APIC ID values, is encoded in the MADT ACPI table.

Operating systems and topology

Operating systems provide logical processor topology information to applications through their own interfaces.

On Linux, topology information is contained in the /proc/cpuinfo pseudo-file, as well as the output of the dmidecode command. In the example below, I'm filtering the contents of cpuinfo on some non-HT quad-core system, leaving only topology-related entries:

Hidden text

[email protected]:~$ cat /proc/cpuinfo |grep "processor\|physical\ id\|siblings\|core\|cores\|apicid" processor: 0 physical id: 0 siblings: 4 core id: 0 cpu cores: 2 apicid: 0 initial apicid: 0 processor: 1 physical id: 0 siblings: 4 core id: 0 cpu cores: 2 apicid: 1 initial apicid: 1 processor: 2 physical id: 0 siblings: 4 core id: 1 cpu cores: 2 apicid: 2 initial apicid: 2 processor: 3 physical id: 0 siblings: 4 core id: 1 cpu cores: 2 apicid: 3 initial apicid: 3

In FreeBSD, the topology is reported via the sysctl mechanism in the kern.sched.topology_spec variable as XML:

Hidden text

[email protected]:~$ sysctl kern.sched.topology_spec kern.sched.topology_spec: 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7 0, 1 THREAD groupSMT group 2, 3 THREAD groupSMT group 4, 5 THREAD groupSMT group 6, 7 THREAD groupSMT group

In MS Windows 8, topology information can be seen in the Task Manager.

Introduction. Computer technology is developing at a rapid pace. Computing devices are becoming more powerful, smaller, more convenient, but recently increasing the performance of devices has become a big problem. In 1965, Gordon Moore (one of the founders of Intel) came to the conclusion that "the number of transistors placed on an integrated circuit chip doubles every 24 months."

The first developments in the field of creating multiprocessor systems began in the 70s. long time the performance of the usual single-core processors was increased by increasing the clock frequency (up to 80% of the performance was determined only by the clock frequency) with a simultaneous increase in the number of transistors on a chip. The fundamental laws of physics stopped this process: the chips began to overheat, the technological one began to approach the size of silicon atoms. All these factors have led to:

leakage currents have increased, as a result of which heat dissipation and power consumption have increased.
The processor has become much "faster" than the memory. Performance was degraded due to latency in accessing RAM and loading data into the cache.
there is such a thing as a "von Neumann bottleneck". It means the inefficiency of the processor architecture when executing a program.

Multiprocessor systems (as one of the ways to solve the problem) were not widely used, since they required expensive and difficult to manufacture multiprocessor motherboards. Based on this, productivity increased in other ways. The concept of multithreading turned out to be effective - the simultaneous processing of several streams of commands.

Hyper-Threading Technology (HTT) or superthreading technology that allows a processor to run multiple program threads on a single core. It was HTT, according to many experts, that became the prerequisite for creating multi-core processors. The execution by the processor of several program threads at the same time is called thread-level parallelism (TLP –thread-level parallelism).

To unleash the potential of a multi-core processor executable program should involve all computing cores, which is not always achievable. Old serial programs that could use only one core will no longer run faster on a new generation of processors, so programmers are increasingly involved in the development of new microprocessors.

1. General concepts

Architecture in a broad sense is a description of a complex system consisting of many elements.

In the process of development, semiconductor structures (microcircuits) evolve, therefore, the principles of constructing processors, the number of elements included in their composition, how their interaction is organized, are constantly changing. Thus, CPUs with the same basic principles of structure are usually called processors of the same architecture. And these principles themselves are called processor architecture (or microarchitecture).

The microprocessor (or processor) is the main component of a computer. It processes information, executes programs, and controls other devices in the system. The power of the processor determines how fast programs will run.

The core is the basis of any microprocessor. It consists of millions of transistors located on a silicon chip. The microprocessor is divided into special cells, which are called general purpose registers (RON). The work of the processor in general consists in extracting commands and data from memory in a certain sequence and executing them. In addition, in order to increase the speed of the PC, the microprocessor is equipped with an internal cache memory. cache memory is inner memory processor, used as a buffer (to protect against interruptions in communication with RAM).

The Intel processors used in IBM-compatible PCs have more than a thousand instructions and belong to the processors with an extended instruction set - CISC-processors (CISC - Complex Instruction Set Computing).

1.1 High performance computing. Parallelism

The pace of development of computing technology is easy to follow: from ENIAC (the first general-purpose electronic digital computer) with a performance of several thousand operations per second to the Tianhe-2 supercomputer (1000 trillion floating-point operations per second). This means that the speed of computing has increased by a trillion times in 60 years. The creation of high-performance computing systems is one of the most difficult scientific and technical tasks. While the computational speed technical means has grown by only a few million times, the overall speed of computing has grown by a trillion times. This effect is achieved through the use of parallelism at all stages of computing. Parallel computing requires the search for rational memory allocation, reliable ways information transfer and coordination of computational processes.

1.2 Symmetric multiprocessing

Symmetric Multiprocessing (abbreviated as SMP) or symmetric multiprocessing is a special architecture of multiprocessor systems in which several processors have access to a shared memory. This is a very common architecture, widely used in recent times.

When using SMP, several processors work simultaneously in a computer, each on its own task. An SMP system with a high-quality operating system rationally distributes tasks between processors, ensuring an even load on each of them. However, there is a problem with memory reversal, because even uniprocessor systems require a relatively long time for this. Thus, access to RAM in SMP occurs sequentially: first one processor, then the second.

Due to the above features, SMP systems are used exclusively in the scientific field, industry, business, extremely rarely in work offices. In addition to the high cost of hardware implementation, such systems require very expensive and high-quality software that provides multi-threaded execution of tasks. Regular programs (games, text editors) will not work effectively on SMP systems because they do not provide this degree of parallelism. If you adapt any program for an SMP system, it will become extremely inefficient to work on single-processor systems, which leads to the need to create several versions of the same program for different systems. The exception is, for example, the ABLETON LIVE program (designed for creating music and preparing Dj-sets), which has support for multiprocessor systems. If run regular program on a multiprocessor system, it will still run a little faster than on a single processor. This is due to the so-called hardware interrupt (stopping the program for processing by the kernel), which is executed on another free processor.

An SMP system (like any other system based on parallel computing) imposes increased requirements on such a memory parameter as the memory bus bandwidth. This often limits the number of processors in a system (modern SMP systems work effectively with up to 16 processors).

Since processors have shared memory, it becomes necessary to use it rationally and coordinate data. In a multiprocessor system, it turns out that several caches work for a shared memory resource. Cache coherence is a cache property that ensures the integrity of data stored in individual caches for a shared resource. This concept- a special case of the concept of memory coherence, where several cores have access to a common memory (it is ubiquitous in modern multi-core systems). If we describe these concepts in general terms, then the picture will be as follows: the same block of data can be loaded into different caches, where the data is processed differently.

If any data change notifications are not used, an error will occur. Cache coherency is designed to resolve such conflicts and maintain consistency of data in caches.

SMP systems are a subset of the MIMD (multi in-struction multi data computing system) classification of computing systems according to Flynn (Professor at Stanford University, co-founder of Palyn Associates). According to this classification, almost all varieties of parallel systems can be attributed to MIMD.

The division of multiprocessor systems into types occurs on the basis of division according to the principle of memory use. This approach made it possible to distinguish the following important types

multiprocessor systems - multiprocessors (multiprocessor systems with shared shared memory) and multicomputers (systems with separate memory). Shared data used in parallel computing requires synchronization. The task of data synchronization is one of the most important problems, and its solution in the development of multiprocessor and multicore and, accordingly, the necessary software is a priority for engineers and programmers. Data can be shared with physical memory allocation. This approach is called non-uniform memory access (NUMA).

These systems include:

Systems where only the individual processor cache is used to represent data (cache-only memory architecture).
Systems with local cache coherence for different processors (cache-coherent NUMA).
Systems with provision public access to the individual memory of processors without the implementation of non-cache coherent NUMA at the hardware level.

Simplification of the problem of creating multiprocessor systems is achieved by using distributed shared memory, but this method leads to a significant increase in the complexity of parallel programming.

1.3 Simultaneous multithreading

Based on all the above disadvantages of symmetric multiprocessing, it makes sense to develop and develop other ways to improve performance. If you analyze the operation of each individual transistor in the processor, you can pay attention to very interesting fact- when performing most computational operations, not all processor components are involved (according to recent studies - about 30% of all transistors). Thus, if the processor performs, say, a simple arithmetic operation, then most of the processor is idle, therefore, it can be used for other calculations. So, if in this moment the processor performs real operations, then an integer arithmetic operation can be loaded into the free part. To increase the load on the processor, you can create speculative (or advanced) execution of operations, which requires a great complication of the processor hardware logic. If the program pre-defines threads (sequences of commands) that can be executed independently of each other, then this will significantly simplify the task ( this method easily implemented in hardware). This idea, which belongs to Dean Tulsen (developed by him in 1955 at the University of Washington), is called simul-taneous multithreading. It was later developed by Intel under the name hyperthreading ( hyper threading). Thus, one processor executing many threads is perceived as an operating Windows system like multiple processors. The use of this technology again requires an appropriate level of software. The maximum effect from the use of multithreading technology is about 30%.

1.4 Multi-core

Multithreading technology is the implementation of multi-core at the software level. Further increase in performance, as always, requires changes in the hardware of the processor. The complication of systems and architectures is not always effective. There is an opposite opinion: “everything ingenious is simple!”. Indeed, in order to increase the performance of the processor, it is not at all necessary to increase its clock frequency, complicate the logical and hardware components, since it is enough to just rationalize and refine the existing technology. This method is very profitable - there is no need to solve the problem of increasing the heat dissipation of the processor, the development of new expensive equipment for the production of microcircuits. This approach was implemented as part of the multi-core technology - the implementation of several computing cores on a single chip. If you take the original processor and compare the performance gains from implementing multiple performance enhancements, it's clear that multi-core technology is the best option.

If we compare the architectures of a symmetric multiprocessor and a multi-core one, they will turn out to be almost identical. The cache memory of the cores can be multi-level (local and shared, and data from RAM can be loaded directly into the second-level cache memory). Based on the considered advantages of the multi-core architecture of processors, manufacturers focus on it. This technology turned out to be quite cheap to implement and universal, which made it possible to bring it to a wide market. In addition, this architecture has made its own adjustments to Moore's law: "the number of computing cores in the processor will double every 18 months."

Looking at the current market computer technology, you can see that devices with four- and eight-core processors dominate. In addition, processor manufacturers say that processors with hundreds of processing cores will soon be seen on the market. As has been repeatedly said before, the full potential of a multi-core architecture is revealed only with high-quality software. Thus, the sphere of production of computer hardware and software is very closely related.

For information industry the beginning of the 21st century coincided with shifts that can be described as "tectonic". The signs of a new era include the use of service-oriented architectures (service-oriented architecture, SOA), cluster configurations and much, much more, including multi-core processors. But, of course, the fundamental reason for what is happening is the development of semiconductor physics, which resulted in an increase in the number of logical elements per unit area, which obeys Gordon Moore's law. The number of transistors on a chip is already in the hundreds of millions and will soon overcome the billion mark, as a result of which the action of the well-known law of dialectics inevitably manifests itself, postulating the relationship of quantitative and qualitative changes. In the changed conditions, a new category comes to the fore - complexity, and systems become complex both at the micro level (processors) and at the macro level (corporate information systems).

To some extent, what is happening in modern computer world can be compared to the evolutionary transition that occurred millions of years ago, when multicellular organisms appeared. By that time, the complexity of one cell had reached a certain limit, and subsequent evolution followed the path of development of infrastructural complexity. The same thing happens with computer systems: the complexity of a single processor core, as well as the monolithic architecture of corporate information systems reached a certain maximum. Now, at the macro level, there is a transition from monolithic systems to component systems (or composed of services), and the attention of developers is focused on infrastructure middleware, and at the micro level, new processor architectures are emerging.

Literally very recently, the idea of complexity has begun to lose its commonly used meaning, turning into an independent factor. In this regard, complexity has not yet been fully comprehended, and the attitude towards it has not been fully defined, although, oddly enough, for almost 50 years there has been a separate scientific discipline, which is called “the theory of complex systems”. (Recall that in theory "complex" refers to a system whose individual components are combined in a non-linear way; such a system is not simply a sum of components, as happens in linear systems.) One can only be surprised that the theory of systems has not yet been accepted by those specialists and companies whose activities lead them to create these complex systems by means of information technology.

Bottleneck of von Neumann architecture

At the micro level, the transition from single-core to multi-core processors (Chip MultiProcessors, CMP) can be analogous to the transition from unicellular to multicellular organisms. CMP provides one of the ways to overcome the inherent weakness of modern processors - the "bottleneck" of the von Neumann architecture.

Here is what John Backus said at the Turing Prize ceremony in 1977: “What is a von Neumann computer? When John von Neumann and others proposed their original architecture 30 years ago, the idea seemed elegant, practical, and simplifies a wide range of engineering and programming problems. And although the conditions that existed at the time of its publication have changed radically since then, we identify our ideas about computers with this old concept. In its simplest form, a von Neumann computer consists of three parts: CPU(CPU or CPU), memory and a channel connecting them, which serves to exchange data between the CPU and memory, and in small portions (only one word each). I propose to call this channel "Von Neumann's bottleneck." Surely there must be a less primitive solution than pumping a huge amount of data through the "narrow bottle neck". Such a channel not only creates a problem for traffic, but is also an "intellectual bottleneck" that imposes "word" thinking on programmers, preventing them from thinking in higher conceptual categories.

Backus was best known for the creation of the Fortran language in the mid-1950s, which over the next few decades was the most popular tool for creating computational programs. But later, apparently, Backus was deeply aware of its weaknesses and realized that he had developed "the most von Neumannian language" of all languages. high level. Therefore, the main pathos of his criticism was directed primarily to imperfect programming methods.

Since Backus's speech, there have been notable advances in programming, with the advent of functional and object-oriented technologies, and with their help, it was possible to overcome what Backus called the "intellectual von Neumann bottleneck." However, the architectural root cause of this phenomenon, the inherent disease of the channel between the memory and the processor - its limited bandwidth - has not disappeared, despite advances in technology over the past 30 years since then. Over the years, this problem has been getting worse, as the speed of memory grows much more slowly than the performance of processors, and the gap between them grows larger.

Von Neumann computer architecture is not the only possible one. From the point of view of organizing the exchange of commands between the processor and memory, all computers can be divided into four classes:

SISD (Single Instruction Single Data)- “one stream of commands, one stream of data”;
SIMD (Single Instruction Multiply Data)- one command stream, many data streams;
MISD (Multiple Instruction Single Data)- many command streams, one data stream;
MIMD (Multiple Instruction Multiple Data)- many command streams, many data streams.

From this classification, it can be seen that the von Neumann machine is a special case that falls into the SISD category. Possible improvements within the SISD architecture are limited to the inclusion of pipelines and other additional functional nodes, as well as the use of different caching methods. Two other categories of architectures (SIMD, which includes vector processors, and MISD pipeline architectures) were implemented in several projects, but did not become mass. If we remain within the framework of this classification, then the only way to overcome the limitations of the "bottleneck" remains the development of MIMD-class architectures. Within their framework, many approaches are found: these can be various parallel and cluster architectures, and multi-threaded processors.

A few years ago, due to technological limitations, all multi-threaded processors were built on the basis of a single core, and such multi-threading was called "simultaneous" - Simultaneous MultiThreading (SMT). And with the advent of multi-core processors, an alternative type of multithreading appeared - Chip MultiProcessors (CMP).

Features of multi-threaded processors

The transition from simple single-threaded processors to logically more complex multi-threaded processors involves overcoming specific difficulties that have not been encountered before. The functioning of the device, where the execution process is divided into agents or threads (threads), has two features:

the principle of non-determination (indetermination principle). In a multithreaded application, the process is broken up into agent threads interacting with each other without predetermined certainty;
uncertainty principle. How exactly the resources will be distributed between the agent threads is also unknown in advance.

Because of these features, the work of multithreaded processors is fundamentally different from deterministic calculations according to the von Neumann scheme. In this case Current state process cannot be defined as a linear function of the previous state and input data, although each of the processes can be considered as a von Neumann micromachine. (When applied to the behavior of threads, one can even use the term "strangeness" used in quantum physics.) The presence of these features brings the multithreaded processor closer to the concept of a complex system, but from a purely practical point of view, it is clear that at the level of process execution there is no non-determinism or there can be no question of uncertainty, and even more so of strangeness. A correctly executed program cannot be strange.

In the very general view A multithreaded processor consists of two types of primitives. The first type is a resource that supports the execution of a thread, which is called a mutex (from Mutual Exclusion - “mutual exclusion”), and the second is an event. How a particular mutex is physically implemented depends on the chosen scheme - SMT or CMP. In any case, the execution of the process boils down to the fact that the next thread captures the mutex for the duration of its execution, and then releases it. If the mutex is occupied by one thread, then the second thread cannot acquire it. The specific procedure for transferring ownership of a mutex from one thread to another can be random; it depends on the implementation of the control, for example, in a particular operating system. In any case, the control must be constructed in such a way that the resources consisting of the mutex are allocated correctly and the effect of uncertainty is suppressed.

Events are objects (event) that signal a change in the external environment. They can put themselves on standby until another event occurs, or report their state to another event. In this way, events can interact with each other, and data continuity between events must be ensured. The pending agent needs to be informed that the data is ready for it. And just as the effect of uncertainty must be suppressed in mutex allocation, so the effect of uncertainty must be suppressed when working with events. The SMT scheme was first implemented in the Compaq Alpha 21464 processors as well as the Intel Xeon MP and Itanium.

Logically, CMP is simpler: here, parallelism is ensured by the fact that each of the threads is processed by its own core. But if the application cannot be divided into threads, then it (in the absence of special measures) is processed by one core, in which case overall performance processor is limited by the speed of one core. At first glance, a processor built according to the SMT scheme is more flexible, and therefore this scheme is preferable. But such a statement is true only at a low density of transistors. If the frequency is measured in megahertz and the number of transistors in a chip approaches a billion, and signal transmission delays become greater than the switching time, then the CMP microarchitecture, in which the associated computing elements are localized, takes advantage.

However, physical parallelization leads to the fact that CMP is not very efficient in sequential calculations. To overcome this shortcoming, an approach called Speculative Multithreading is used. In Russian, the word "speculative" has a negative connotation, so we will call such multithreading "conditional". This approach involves hardware or software support dividing a sequential application into conditional flows, coordinating their execution, and integrating the results in memory.

Evolution of the CMP

The first mass-produced CMP processors were intended for the server market. Regardless of the vendor, they were essentially two independent superscalar processors on the same substrate. The main motivation for these designs is to reduce the volume so that more processors can be "packed" in one design, increasing the power density per unit volume (which is critical for today's data processing centers). Then, at the overall system level, some additional savings are achieved, since the processors on the same chip use common system resources such as high-speed communications. Usually, there is only a common thread between adjacent processors. system interface (rice. one, b).

Apologists for the use of CMP processors justify the further (over two) increase in the number of cores by the features of the server load, which distinguishes this type of computer from embedded or systems designed for massive computing. Greater overall performance is required from the server, but the delay of a single call to the system is not so critical. A trivial example: a user may simply not notice a millisecond delay in the appearance of an updated Web page, but is quite sensitive to server overload, which can cause service outages.

The specificity of the load gives CMP processors another noticeable advantage. For example, by replacing a single-core processor with a dual-core processor, you can halve the clock speed at the same performance. In this case, the processing time of a single request can theoretically double, but since the physical separation of threads reduces the bottleneck limitation of the von Neumann architecture, the total delay will be much less than a factor of two. With a lower frequency and complexity of a single core, energy consumption is significantly reduced, and with an increase in the number of cores, the above arguments in favor of CMP only become stronger. Therefore, the next logical step is to assemble several cores and combine them with a common cache memory, for example, as in the Hydra project (Figure 1, c). And then you can complicate the cores and make them multi-threaded, which was implemented in the Niagara project (Figure 1, d).

The complexity of processors has another important manifestation. Designing a product with billions of components is becoming an increasingly time consuming task - despite the use of automation tools. It is significant that we are witnessing more than a decade of "bringing to mind" the IA-64 architecture. Designing a CMP processor is much simpler: if there is a developed core, then it can be replicated in the required quantities, and the design is limited to creating the internal infrastructure of the crystal. In addition, the uniformity of cores simplifies the design motherboards, which comes down to scaling, and ultimately, the indicators of I / O subsystems change.

Despite these arguments, there is not yet sufficient evidence to unambiguously state the benefits of CMP over SMT. Experience in creating processors that implement SMT is much greater: since the mid-80s, several dozen experimental products and several serial processors have been created. The history of CPM development is still short: if you do not take into account the Texas Instruments TMS 320C8x family of specialized signal processors, then the first successful project became Hydra, made at Stanford University. Among university research projects aimed at building CMP processors, three more are known - Wisconsin Multiscalar, Carnegie-Mellon Stampede and MIT M-machine.

Hydra microprocessor

The Hydra die consists of four processor cores based on the well-known MIPS RISC architecture. Each core has an instruction cache and a data cache, and all cores are combined into a common L2 cache. Processors execute the usual set of MIPS instructions plus Store Conditional or SC instructions for implementing synchronization primitives. The processors and L2 cache are connected by read/write buses, and in addition, there are auxiliary address and control buses. All these buses are virtual, that is, they are logically represented by wire buses, and are physically divided into many segments using repeaters and buffers, which allows increasing the speed of the cores.

The read/write bus plays the role of a system bus. Due to its location inside the crystal, it has sufficient throughput to provide an exchange with the cache memory in one cycle. It is difficult to achieve such exchange performance even in the most expensive traditional multiprocessor architectures due to physical limitations on the number of external processor pins. Efficient cache busses prevent bottleneck problems between cores and memory.

Testing Hydra under explicit concurrency workloads on typical Web and server applications showed that the performance of four cores compared to one core increases by 3-3.8 times, that is, almost linearly. This gives reason to believe that processors of this type will quite successfully "fit" into those applications that use servers with an SMP architecture. But it is clear that the processor must work efficiently enough with sequential applications, so one of the most important tasks is to implement conditional multithreading. In Hydra, it is implemented at the hardware level, and the choice of this approach is justified by the fact that it does not require additional costs for programming parallel applications.

Conditional multithreading is based on splitting the sequence of program commands into threads that can be executed in parallel. Naturally, there can be a logical interdependence between such threads, and a special synchronization mechanism is built into the processor to coordinate them. The essence of its work is that if some thread needs data from a parallel thread, but they are not yet ready, then the execution of such a thread is suspended. In fact, the elements of non-determinism, which were discussed above, are manifested here. The synchronization process is quite complex, since it is necessary to determine all possible dependencies between threads and synchronization conditions. Conditional synchronization allows you to parallelize programs without prior knowledge of their properties. It is important that the synchronization mechanism is dynamic, it works without the intervention of a programmer or compiler, which is only capable of static division of applications into threads. Tests of the model based on various tests have shown that conditional multithreading tools can increase processor performance by several times, and the more explicit the test is characterized by parallelism, the lower this value.

In 2000, Afara was created in strict secrecy. It was founded by Professor Kunle Olukotun of Stanford University and well-known processor designer Les Kohn, who had experience at Intel and Sun Microsystems. Kon was one of the authors of the i860 and i960 RISC processors in the first of these corporations and UltraSPARC-I in the second. Under his leadership, Hydra was redesigned for processor cores based on the SPARC processor. In 2002, Afara was bought by Sun Microsystems, and that was the end of the Hydra project and the beginning of Niagara.

Niagara - "fusion" of MAJC and Hydra

The UltraSPARC T1 processor, better known as Niagara, has two main predecessors - Hydra and MAJC.

In the mid-1990s, in the wake of the craze for specialized Java processors, Sun Microsystems attempted to create a Very Long Instruction Word (VLIW) processor. This initiative is called MAJC (Microprocessor Architecture for Java Computing). As with other projects that started at the time (Intel IA-64 Itanium), this one was about moving some of the most complex operations to the compiler. The freed up transistor logic can be used to create more efficient functional units in order to provide a productive exchange of commands and data between the CPU, cache memory and main memory. Thus, the von Neumann bottleneck was overcome.

MAJC differed from most processors in the absence of specialized coprocessors (subprocessors), which are usually called functional devices designed to perform operations on integers, floating point numbers, and multimedia data. In it, all functional devices were the same, capable of performing any operations, which, on the one hand, reduced the efficiency of performing individual operations, but, on the other hand, increased the utilization of the entire processor.

Niagara embodies the best of two alternative approaches to multithreading - SMT and CMP. At first glance, it looks very similar to Hydra, but rather Hydra can be called a "layout" of Niagara. In addition to the fact that the latter has twice as many cores, each of them can process four threads.

The Niagara processor provides hardware support for the execution of 32 threads, which are divided into eight groups (four threads each). Each group has its own processing channel SPARC pipeline ( fig.2). It is a processor core built according to the SPARC V9 architecture. Each SPARC pipeline contains a first level cache for commands and data. Together, 32 threads share a 3 MB L2 cache divided into four banks. The switch connects eight cores, second-level cache banks, and other shared CPU resources, and supports a transfer rate of 200 GB / s. In addition, the switch has a port for I / O systems and channels to DDR2 DRAM memory, providing an exchange rate of 20 GB / s; the maximum memory capacity is up to 128 GB.

The Niagara project is focused on the Solaris operating system, so all applications running on Solaris can run on the new processor without any changes. Applied software perceives Niagara as 32 discrete processors.

Cell project

IBM Corporation proposed its own approach to the creation of multi-core processors, whose Cell project is called a “heterogeneous chip multiprocessor”. The Cell architecture is also called the Cell Broadband Engine Architecture (CBEA). The Cell multiprocessor consists of an IBM 64-bit Power Architecture core and eight specialized coprocessors that implement the “one instruction, many data” scheme. IBM calls this architecture the Synergistic Processor Unit (SPU). It can be successfully used when it is necessary to process large data streams, for example in cryptography, in various multimedia and scientific applications, such as fast Fourier transform or matrix operations. The Cell architecture was created by a group of researchers at IBM Research, together with colleagues from the IBM Systems Technology Group, Sony and Toshiba, and its first application should be multimedia devices that require large amounts of calculations.

The basis of the Synergistic Processor Unit is the Instruction Set Architecture (ISA) instruction set. Instructions are 32 bits long and address three operands in a register pool, which consists of 128 registers of 128 bits each.

In the future, the use of Cell will not be limited to gaming systems. Next in line are high-definition television, home servers, and even supercomputers.

Literature

Leonid Chernyak. Revision of fundamental principles - the end of stagnation? // open systems. - 2003, №5.
Mikhail Kuzminsky. Multithread architecture of microprocessors // Open systems. - 2002, No. 1.
Rajat A Dua, Bhushan Lokhande. A Comparative study of SMT and CMP multiprocessors. -