Internet

NVidia CUDA: graphics card computing or CPU death? Nvidia CUDA? non-graphical calculations on GPUs The version of cuda differs in some ways.

A new technology is like a newly emerged evolutionary species. A strange creature, unlike the many old-timers. Sometimes awkward, sometimes funny. And at first, his new qualities seem by no means suitable for this habitable and stable world.

However, a little time passes, and it turns out that the beginner runs faster, jumps higher and generally stronger. And he eats more flies than his retrograde neighbors. And then these very neighbors begin to understand that it is not worth quarreling with this former clumsy one. It is better to be friends with him, and even better to organize a symbiosis. You look, and there will be more flies.

GPGPU (General-Purpose Graphics Processing Units) technology has long existed only in the theoretical calculations of brainy academics. How else? Propose a radical change in the decades-old computing process, entrusting the calculation of its parallel branches to the video card - only theorists are capable of this.

The CUDA technology logo reminds us that it has grown in the depths
3D graphics.

But GPGPU technology was not going to gather dust on the pages of university journals for a long time. Having fluffed out the feathers of her best qualities, she attracted the attention of producers. This is how CUDA, an implementation of GPGPU on NVIDIA's GeForce GPUs, was born.

Thanks to CUDA, GPGPU technologies have become mainstream. And now only the most short-sighted and laziest developer of programming systems does not claim support for CUDA with their product. IT publications were honored to present the details of the technology in numerous plump popular science articles, and competitors urgently sat down at templates and cross compilers to develop something similar.

Public recognition is not only a dream of aspiring starlets, but also of newly emerging technologies. And CUDA got lucky. She is heard, they talk about her and write about her.

They just write as if they continue to discuss GPGPU in thick scientific journals. They bombard the reader with a bunch of terms like "grid", "SIMD", "warp", "host", "textural and constant memory". They immerse it to the very top in nVidia's GPU organization schemes, lead it through winding paths of parallel algorithms, and (the strongest move) show long listings of code in the C language. As a result, it turns out that at the input of the article we have a fresh and burning desire to understand CUDA reader, and at the output - the same reader, but with a swollen head filled with a mess of facts, diagrams, code, algorithms and terms.

Meanwhile, the goal of any technology is to make our life easier. And CUDA does a great job with it. The results of her work are what will convince any skeptic better than hundreds of schemes and algorithms.

Far from everywhere

CUDA supported by high performance supercomputers
NVIDIA Tesla.

And yet, before looking at the results of CUDA's work in the field of making life easier for the average user, it is worth understanding all its limitations. Just like with a genie: any desire, but one. CUDA also has its Achilles heels. One of them is the limitations of the platforms on which it can work.

The list of video cards manufactured by nVidia that support CUDA is presented in a special list called CUDA Enabled Products. The list is quite impressive, but easily classified. CUDA support is not denied:

Models nVidia GeForce 8th, 9th, 100th, 200th and 400th series with a minimum of 256 MB of video memory on board. Support extends to both desktop and mobile cards.

The vast majority of desktop and mobile graphics cards are nVidia Quadro.

All solutions of the nvidia ION netbook series.

High-performance HPC (High Performance Computing) and nVidia Tesla supercomputing solutions used both for personal computing and for organizing scalable cluster systems.

Therefore, before using CUDA-based software products, it is worth checking this list of favorites.

In addition to the graphics card itself, CUDA support requires an appropriate driver. It is he who is the link between the central and graphic processor, acting as a kind of software interface for accessing code and program data to the multi-core treasury of the GPU. In order to be sure not to make a mistake, nVidia recommends visiting the drivers page and getting the most recent version.

...but the process

How does CUDA work? How to explain the complex process of parallel computing on a specific GPU hardware architecture without plunging the reader into an abyss of specific terms?

You can try to do this by imagining how the central processor executes the program in symbiosis with the graphics processor.

Architecturally, the central processing unit (CPU) and its graphic counterpart (GPU) are arranged differently. If we draw an analogy with the world of the automotive industry, then the CPU is a station wagon, one of those that are called "barn". It looks like a passenger car, but at the same time (from the point of view of the developers) "and a Swiss, and a reaper, and a gamer on the pipe." Performs the role of a small truck, bus and hypertrophied hatchback at the same time. Universal, in short. He has few cylinder cores, but they "pull" almost any task, and an impressive cache memory can accommodate a bunch of data.

But the GPU is a sports car. There is only one function: to deliver the pilot to the finish line as quickly as possible. Therefore, no large memory-trunk, no extra seats. But there are hundreds of times more core cylinders than the CPU.

With CUDA, GPGPU developers don't need to understand the complexities of programming.
for graphics engines such as DirectX and OpenGL

Unlike the central processor, which is capable of solving any task, including graphics, but with average performance, the graphics processor is adapted to a high-speed solution of one task: turning heaps of polygons at the input into a bunch of pixels at the output. Moreover, this task can be solved in parallel on hundreds of relatively simple computing cores as part of the GPU.

So what can be a tandem of station wagon and sports car? The work of CUDA goes something like this: the program is executed on the CPU until it contains a piece of code that can be executed in parallel. Then, instead of being slowly executed on two (even eight) cores of the coolest CPU, it is transferred to hundreds of GPU cores. At the same time, the execution time of this section is reduced several times, which means that the execution time of the entire program is also reduced.

Technologically, nothing changes for a programmer. The code of CUDA programs is written in C language. More precisely, in its special dialect "C with streams" (C with streams). Developed at Stanford, this extension of the C language is called Brook. The interface that transmits the Brook code to the GPU is the driver for a video card that supports CUDA. It organizes the entire process of processing this section of the program in such a way that the GPU looks like a CPU coprocessor to the programmer. Much like using a math coprocessor in the early days of personal computers. With the advent of Brook, video cards with CUDA support and drivers for them, any programmer has become able to access the GPU in his programs. But before this shamanism was owned by a narrow circle of the elite, who have been honing programming techniques for DirectX or OpenGL graphics engines for years.

In the barrel of this pretentious honey - CUDA praises - it is worth putting a fly in the ointment, that is, restrictions. Not every task that needs to be programmed is suitable for solving using CUDA. It will not be possible to achieve acceleration in solving routine office tasks, but you can trust CUDA to calculate the behavior of thousands of the same type of fighters in World of Warcraft. But this is a task sucked from the finger. Let's consider examples of what CUDA already solves very effectively.

Righteous works

CUDA is a very pragmatic technology. Having implemented its support in their video cards, nVidia quite rightly expected that the CUDA banner would be picked up by many enthusiasts both in the university environment and in commerce. And so it happened. CUDA-based projects are alive and well.

NVIDIA PhysX

Advertising the next gaming masterpiece, manufacturers often emphasize its 3D realism. But no matter how real the 3D game world is, if the elementary laws of physics, such as gravity, friction, hydrodynamics, are implemented incorrectly, the falsity will be felt instantly.

One of the features of the NVIDIA PhysX physics engine is realistic work with tissues.

Implementing algorithms for computer simulation of basic physical laws is a very laborious task. The most famous companies in this field are the Irish company Havok with its cross-platform physics Havok Physics and the Californian Ageia - the progenitor of the world's first physical processor (PPU - Physics Processing Unit) and the corresponding PhysX physics engine. The first of them, although acquired by Intel, is now actively working in the field of optimizing the Havok engine for ATI video cards and AMD processors. But Ageia with its PhysX engine became part of nVidia. At the same time, nVidia solved the rather difficult task of adapting PhysX to CUDA technology.

This is made possible by statistics. It has been statistically proven that no matter how complex rendering a GPU performs, some of its cores are still idle. It is on these cores that the PhysX engine works.

Thanks to CUDA, the lion's share of calculations related to the physics of the game world began to be performed on the video card. The released power of the central processor was thrown to other tasks of the gameplay. The result was not long in coming. According to experts, the performance gain of the gameplay with PhysX running on CUDA has increased by at least an order of magnitude. The plausibility of the implementation of physical laws has also grown. CUDA takes care of the routine calculation of the implementation of friction, gravity and other things familiar to us for multidimensional objects. Now not only the heroes and their equipment fit perfectly into the laws of the physical world familiar to us, but also dust, fog, blast wave, flame and water.

CUDA version of the NVIDIA Texture Tools 2 texture compression package

Do you like realistic objects in modern games? It is worth saying thanks to the texture developers. But the more reality in the texture, the greater its volume. The more it takes up precious memory. To avoid this, textures are pre-compressed and dynamically decompressed as needed. And compression and decompression are continuous calculations. To work with textures, nVidia has released the NVIDIA Texture Tools package. It supports efficient compression and decompression of DirectX textures (the so-called HFC format). The second version of this package boasts support for the BC4 and BC5 compression algorithms implemented in DirectX 11 technology. But the main thing is that NVIDIA Texture Tools 2 supports CUDA. According to nVidia, this gives a 12-fold increase in performance in the tasks of compression and decompression of textures. And this means that the frames of the gameplay will load faster and delight the player with their realism.

The NVIDIA Texture Tools 2 package is tailored to work with CUDA. The increase in performance when compressing and decompressing textures is evident.

Using CUDA can significantly improve the efficiency of video surveillance.

Real-time video stream processing

Say what you like, but the current world, in terms of espionage, is much closer to the world of Orwellian Big Brother than it seems. The gaze of video cameras is felt by both car drivers and visitors to public places.

Full-flowing rivers of video information flow into the centers of its processing and ... run into a bottleneck - a person. It is he who, in most cases, is the last resort that monitors the video world. Moreover, the agency is not the most efficient. Blinking, distracted and striving to fall asleep.

Thanks to CUDA, it became possible to implement algorithms for simultaneously tracking multiple objects in a video stream. In this case, the process takes place in real time, and the video is full 30 fps. Compared to the implementation of such an algorithm on modern multi-core CPUs, CUDA gives a two- or three-fold increase in performance, and this, you see, is quite a lot.

Video conversion, audio filtering

Badaboom Video Converter is the first to use CUDA to speed up conversion.

It's nice to watch the new video rental in FullHD quality and on the big screen. But you can’t take a big screen with you on the road, and the FullHD video codec will hiccup on the low-powered processor of a mobile gadget. Conversion comes to the rescue. But most of those who have come across it in practice complain about the long conversion time. It is understandable, the process is routine, suitable for parallelization, and its execution on the CPU is not very optimal.

But CUDA copes with it with a bang. The first sign is the Badaboom converter from Elevental. Badaboom developers didn't miscalculate by choosing CUDA. Tests show that a standard hour and a half movie is converted to iPhone/iPod Touch format in less than twenty minutes. And this despite the fact that when using only the CPU, this process takes more than an hour.

Helps CUDA and professional music lovers. Any of them will give half the kingdom for an effective FIR-crossover - a set of filters that divide the sound spectrum into several bands. This process is very time-consuming and, with a large amount of audio material, makes the sound engineer go to “smoke” for several hours. The implementation of the FIR crossover based on CUDA speeds up its work hundreds of times.

CUDA Future

Having made GPGPU technology a reality, CUDA is not going to rest on its laurels. As it happens everywhere, the principle of reflection works in CUDA: now not only the architecture of nVidia video processors influences the development of CUDA SDK versions, but the CUDA technology itself forces nVidia to revise the architecture of its chips. An example of such reflection is the nVidia ION platform. Its second version is specially optimized for solving CUDA tasks. And this means that even in relatively inexpensive hardware solutions, consumers will get all the power and brilliant features of CUDA.

– a set of low-level programming interfaces ( API) for creating games and other high-performance multimedia applications. Includes support for high performance 2D- and 3D-graphics, sound and input devices.

Direct3D (D3D) – 3D output interface primitives(geometric bodies). Included in .

OpenGL(from English. Open Graphics Library, literally - an open graphics library) is a specification that defines a programming language-independent cross-platform programming interface for writing applications that use two-dimensional and three-dimensional computer graphics. Includes over 250 functions for drawing complex 3D scenes from simple primitives. It is used in the creation of video games, virtual reality, visualization in scientific research. On the platform Windows competes with .

OpenCL(from English. Open Computing Language, literally - an open computing language) - framework(framework of a software system) for writing computer programs related to parallel computing on various graphics ( GPU) and ( ). To the framework OpenCL includes a programming language and an application programming interface ( API). OpenCL provides parallelism at the instruction level and at the data level and is the implementation of the technique GPGPU.

GPGPU(abbreviated from English. General-Purpose G raphics P rocessing Units, literally - GPU general purpose) - a technique for using the graphics processor of a video card for general calculations, which is usually carried out.

shader(English) shader) is a program for constructing shadows on synthesized images, used in three-dimensional graphics to determine the final parameters of an object or image. Typically includes descriptions of light absorption and scattering, texture mapping, reflection and refraction, shading, surface displacement, and post-processing effects of arbitrary complexity. Complex surfaces can be rendered using simple geometric shapes.

rendering(English) rendering) - visualization, in computer graphics, the process of obtaining an image from a model using software.

SDK(abbreviated from English. Software Development Kit) is a set of software development tools.

CPU(abbreviated from English. Central Processing Unit, literally - central / main / main computing device) - central (micro); device that executes machine instructions; a piece of hardware that is responsible for performing computational operations (given by the operating system and application software) and coordinating the work of all devices.

GPU(abbreviated from English. Graphic Processing Unit, literally - a graphic computing device) - a graphics processor; a separate device or game console that performs graphics rendering (visualization). Modern GPUs are very efficient at processing and rendering computer graphics realistically. The graphics processor in modern video adapters is used as a 3D graphics accelerator, but in some cases it can also be used for calculations ( GPGPU).

Problems CPU

For a long time, the increase in the performance of traditional ones was mainly due to the sequential increase in the clock frequency (about 80% of the performance was determined by the clock frequency) with a simultaneous increase in the number of transistors on a single chip. However, a further increase in the clock frequency (at a clock frequency of more than 3.8 GHz, the chips simply overheat!) Resists a number of fundamental physical barriers (since the technological process has almost come close to the size of an atom: , and the size of a silicon atom is approximately 0.543 nm):

First, with a decrease in the size of the crystal and with an increase in the clock frequency, the leakage current of transistors increases. This leads to an increase in power consumption and an increase in heat emission;

Second, the benefits of higher clock speeds are offset in part by memory access latencies, as memory access times do not match increasing clock speeds;

Third, for some applications, traditional serial architectures become inefficient as clock speeds increase due to the so-called “Von Neumann bottleneck,” a performance bottleneck resulting from the sequential flow of computation. At the same time, resistive-capacitive signal transmission delays increase, which is an additional bottleneck associated with an increase in the clock frequency.

Development GPU

In parallel with the development of GPU:

…

November 2008 - Intel introduced a line of 4-core Intel Core i7 based on next-generation microarchitecture Nehalem. Processors operate at a clock frequency of 2.6-3.2 GHz. Made in 45nm process technology.

December 2008 - Quad core shipments started AMD Phenom II 940(code name - Deneb). Operates at a frequency of 3 GHz, is produced according to the 45-nm process technology.

…

May 2009 - company AMD introduced the GPU version ATI Radeon HD 4890 with the core clock speed increased from 850 MHz to 1 GHz. This is the first graphic processor running at 1 GHz. The processing power of the chip, due to the increase in frequency, has grown from 1.36 to 1.6 teraflops. The processor contains 800 (!) cores, supports video memory GDDR5, DirectX 10.1, ATI CrossFireX and all other technologies inherent in modern video card models. The chip is made on the basis of 55-nm technology.

Main differences GPU

Distinctive features GPU(compared with ) are:

– an architecture that is maximally aimed at increasing the speed of calculating textures and complex graphic objects;

is the peak power of a typical GPU much higher than ;

– thanks to a dedicated pipeline architecture, GPU much more efficient in processing graphic information than .

"The Crisis of the Genre"

"The Crisis of the Genre" for matured by 2005 - it was then that they appeared. But, despite the development of technology, the growth in the productivity of conventional decreased markedly. At the same time performance GPU continues to grow. So, by 2003, this revolutionary idea crystallized - use the computing power of the graphic. GPUs have become actively used for “non-graphical” computing (physics simulation, signal processing, computational mathematics/geometry, database operations, computational biology, computational economics, computer vision, etc.).

The main problem was that there was no standard interface for programming GPU. The developers used OpenGL or Direct3D but it was very convenient. Corporation NVIDIA(one of the largest manufacturers of graphics, media and communication processors, as well as wireless media processors; founded in 1993) was engaged in the development of some kind of unified and convenient standard - and introduced the technology CUDA.

How it started

2006 - NVIDIA demonstrates CUDA™; the start of a revolution in computing GPU.

2007 - NVIDIA releases architecture CUDA(original version CUDA SDK was presented February 15, 2007); nomination " Best New» from magazine Popular Science and "Readers' Choice" from the publication HPCWire.

2008 - technology NVIDIA CUDA won in the nomination "Technical Excellence" from PC Magazine.

What's happened CUDA

CUDA(abbreviated from English. Compute Unified Device Architecture, literally - a unified computing architecture of devices) - an architecture (a set of software and hardware) that allows you to produce on GPU general purpose computations GPU actually acts as a powerful coprocessor.

Technology NVIDIA CUDA™ is the only development environment in a programming language C, which allows developers to create software to solve complex computational problems in less time, thanks to the processing power of GPUs. Millions already work in the world GPU with the support CUDA, and thousands of programmers are already using (for free!) tools CUDA to accelerate applications and solve the most complex resource-intensive tasks - from video and audio encoding to oil and gas exploration, product modeling, medical imaging and scientific research.

CUDA gives the developer the opportunity, at his own discretion, to organize access to the instruction set of the graphics accelerator and manage its memory, organize complex parallel calculations on it. Supported graphics accelerator CUDA becomes a powerful programmable open architecture like today's . All this provides the developer with low-level, distributed and high-speed access to the equipment, making CUDA a necessary basis for building serious high-level tools such as compilers, debuggers, mathematical libraries, software platforms.

Uralsky, Lead Technology Specialist NVIDIA, comparing GPU and , says like this: - It's an SUV. He travels always and everywhere, but not very fast. A GPU is a sports car. On a bad road, he simply will not go anywhere, but give a good coverage - and he will show all his speed, which the SUV never dreamed of! ..».

Technology Capabilities CUDA

Devices for turning personal computers into small supercomputers have been known for a long time. Back in the 80s of the last century, so-called transputers were offered on the market, which were inserted into the then common ISA expansion slots. At first, their performance in the corresponding tasks was impressive, but then the growth in performance of universal processors accelerated, they strengthened their position in parallel computing, and there was no point in transputers. Although such devices still exist, they are a variety of specialized accelerators. But often the scope of their application is narrow and such accelerators are not widely used.

But recently, the baton of parallel computing has moved to the mass market, one way or another connected with three-dimensional games. General-purpose devices with multi-core processors for parallel vector computing used in 3D graphics achieve high peak performance that general-purpose processors cannot. Of course, the maximum speed is achieved only in a number of convenient tasks and has some limitations, but such devices have already begun to be widely used in areas for which they were not originally intended. great example Such a parallel processor is the Cell processor developed by the Sony-Toshiba-IBM alliance and used in the Sony PlayStation 3 game console, as well as all modern video cards from market leaders - Nvidia and AMD.

We will not touch Cell today, although it appeared earlier and is a universal processor with additional vector capabilities, we are not talking about it today. For 3D video accelerators, the first general-purpose non-graphical computation technologies GPGPU (General-Purpose computation on GPUs) appeared several years ago. After all, modern video chips contain hundreds of mathematical execution units, and this power can be used to significantly speed up many computationally intensive applications. And the current generations of GPUs have a sufficiently flexible architecture that, together with high-level programming languages and hardware-software architectures like the one discussed in this article, opens up these possibilities and makes them much more accessible.

The creation of the GPCPU was prompted by the emergence of sufficiently fast and flexible shader programs that are capable of executing modern video chips. The developers decided to make the GPU calculate not only the image in 3D applications, but also be used in other parallel calculations. In the GPGPU, graphics APIs were used for this: OpenGL and Direct3D, when data was transmitted to the video chip in the form of textures, and calculation programs loaded as shaders. The disadvantages of this method are the relatively high complexity of programming, low speed data exchange between the CPU and GPU and other restrictions, which we will discuss later.

GPU computing has evolved and is evolving very rapidly. And further on, two major video chip manufacturers, Nvidia and AMD, developed and announced respective platforms called CUDA (Compute Unified Device Architecture) and CTM (Close To Metal or AMD Stream Computing), respectively. Unlike previous GPU programming models, these were done with direct access to the hardware capabilities of the graphics cards. The platforms are not compatible with each other, CUDA is an extension of the C programming language, and CTM is virtual machine, executing assembly code. But both platforms have eliminated some of the important limitations of previous GPGPU models using the traditional graphics pipeline and the corresponding Direct3D or OpenGL interfaces.

Of course, open standards that use OpenGL seem to be the most portable and universal, they allow you to use the same code for video chips from different manufacturers. But such methods have a lot of disadvantages, they are much less flexible and not as convenient to use. In addition, they prevent the use of the specific features of certain video cards, such as the fast shared (shared) memory present in modern computing processors.

That is why Nvidia released the CUDA platform, a C-like programming language with its own compiler and libraries for GPU computing. Of course, writing the optimal code for video chips is not at all that easy and this task requires long manual work, but CUDA just reveals all the possibilities and gives the programmer more control over the hardware capabilities of the GPU. It is important that Nvidia CUDA support is available for the G8x, G9x and GT2xx chips used in Geforce 8, 9 and 200 series video cards, which are very widespread. The final version of CUDA 2.0 has now been released, which has some new features, such as support for double precision calculations. CUDA is available on 32-bit and 64-bit Linux, Windows, and MacOS X operating systems.

Difference between CPU and GPU in Parallel Computing

The growth of the frequencies of universal processors ran into physical limitations and high power consumption, and their performance is increasingly increasing due to the placement of several cores in a single chip. Processors sold now contain only up to four cores (further growth will not be fast) and they are intended for general applications, use MIMD - multiple instruction and data flow. Each core operates separately from the others, executing different instructions for different processes.

Specialized vector capabilities (SSE2 and SSE3) for 4-component (single-precision floating-point) and two-component (double-precision) vectors have appeared in general-purpose processors due to the increased demands of graphics applications in the first place. That is why for certain tasks the use of GPUs is more profitable, because they were originally made for them.

For example, in Nvidia video chips, the main unit is a multiprocessor with eight to ten cores and hundreds of ALUs in total, several thousand registers and a small amount of shared shared memory. In addition, the video card contains fast global memory with access to it by all multiprocessors, local memory in each multiprocessor, as well as special memory for constants.

Most importantly, these multiple multiprocessor cores in the GPU are SIMD (single instruction stream, multiple data stream) cores. And these cores execute the same instructions at the same time, this style of programming is common in graphics algorithms and many scientific tasks, but requires specific programming. But this approach allows you to increase the number of execution units due to their simplification.

So, let's list the main differences between CPU and GPU architectures. CPU cores are designed to execute a single stream of sequential instructions at maximum performance, while GPUs are designed to quickly execute large numbers of parallel instruction streams. General purpose processors are optimized to achieve high performance on a single instruction stream that processes both integers and floating point numbers. The memory access is random.

CPU designers try to get as many instructions as possible to execute in parallel in order to increase performance. To do this, starting with the Intel Pentium processors, superscalar execution appeared, providing the execution of two instructions per clock, and the Pentium Pro distinguished itself by out-of-order execution of instructions. But the parallel execution of a sequential stream of instructions has certain basic limitations, and by increasing the number of execution units, a multiple increase in speed cannot be achieved.

Video chips have a simple and parallel operation from the very beginning. The video chip takes a group of polygons at the input, performs all the necessary operations, and produces pixels at the output. Processing of polygons and pixels is independent, they can be processed in parallel, separately from each other. Therefore, due to the inherently parallel organization of work in the GPU, a large number of execution units are used, which are easy to load, in contrast to the sequential flow of instructions for the CPU. In addition, modern GPUs can also execute more than one instruction per clock (dual issue). Thus, the Tesla architecture, under certain conditions, launches the MAD+MUL or MAD+SFU operations simultaneously.

The GPU differs from the CPU also in terms of the principles of memory access. In the GPU, it is connected and easily predictable - if a texture texel is read from memory, then after a while the time will come for neighboring texels. Yes, and when recording the same - the pixel is written to the framebuffer, and after a few cycles, the one located next to it will be recorded. Therefore, the memory organization is different from that used in the CPU. And the video chip, unlike universal processors, simply does not need a large cache memory, and textures require only a few (up to 128-256 in current GPUs) kilobytes.

And in itself, the work with memory for the GPU and CPU is somewhat different. So, not all CPUs have built-in memory controllers, and all GPUs usually have several controllers, up to eight 64-bit channels in the Nvidia GT200 chip. In addition, video cards use faster memory, and as a result, video chips have many times more memory bandwidth available, which is also very important for parallel calculations that operate with huge data streams.

In general-purpose processors, large numbers of transistors and chip area go to instruction buffers, hardware branch prediction, and huge amounts of on-chip cache memory. All these hardware blocks are needed to speed up the execution of a few instruction streams. Video chips spend transistors on arrays of execution units, flow control units, small shared memory, and multi-channel memory controllers. The above does not speed up the execution of individual threads, it allows the chip to process several thousand threads that are simultaneously executing on the chip and require high memory bandwidth.

About differences in caching. General purpose CPUs use cache to increase performance by reducing memory access latency, while GPUs use cache or shared memory to increase bandwidth. CPUs reduce memory access latencies with large caches and code branch prediction. These hardware pieces take up most of the chip area and consume a lot of power. Video chips get around the problem of memory access delays by simultaneously executing thousands of threads - while one of the threads is waiting for data from memory, the video chip can perform calculations of another thread without waiting and delays.

There are many differences in multithreading support as well. The CPU executes 1-2 computation threads per processor core, and video chips can support up to 1024 threads per multiprocessor, of which there are several in the chip. And if switching from one thread to another for the CPU costs hundreds of cycles, then the GPU switches several threads in one cycle.

In addition, CPUs use SIMD (single instruction, multiple data) blocks for vector computing, and GPUs use SIMT (single instruction, multiple threads) for scalar thread processing. SIMT does not require the developer to convert data to vectors and allows arbitrary branching in streams.

In short, we can say that, unlike modern universal CPUs, video chips are designed for parallel computing with a large number of arithmetic operations. And a much larger number of GPU transistors work for their intended purpose - the processing of data arrays, and do not control the execution (flow control) of a few sequential computational threads. This is a diagram of how much space in the CPU and GPU takes a variety of logic:

As a result, the basis for the effective use of the power of the GPU in scientific and other non-graphical calculations is the parallelization of algorithms into hundreds of execution units available in video chips. For example, many applications of molecular modeling are well suited for calculations on video chips, they require large computing power and are therefore convenient for parallel computing. And the use of multiple GPUs provides even more computing power for solving such problems.

Performing calculations on the GPU shows excellent results in algorithms that use parallel data processing. That is, when the same sequence of mathematical operations is applied to a large amount of data. In this case, the best results are achieved if the ratio of the number of arithmetic instructions to the number of memory accesses is large enough. This places less demands on flow control, and the high density of math and large amount of data eliminates the need for large caches, as on the CPU.

As a result of all the differences described above, the theoretical performance of video chips significantly exceeds the performance of the CPU. Nvidia provides the following graph of CPU and GPU performance growth over the past few years:

Naturally, these data are not without a share of slyness. Indeed, on the CPU it is much easier to achieve theoretical figures in practice, and the figures are given for single precision in the case of the GPU, and for double precision in the case of the CPU. In any case, single-precision is enough for some parallel tasks, and the difference in speed between universal and graphic processors is very large, and therefore the game is worth the candle.

The first attempts to apply calculations on the GPU

Video chips have been used in parallel mathematical calculations for a long time. The very first attempts at such an application were extremely primitive and limited to the use of some hardware features, such as rasterization and Z-buffering. But in the current century, with the advent of shaders, they began to speed up the calculation of matrices. In 2003, at SIGGRAPH, a separate section was allocated for GPU computing, and it was called GPGPU (General-Purpose computation on GPU) - universal GPU computing).

The best known BrookGPU is the Brook stream programming language compiler, designed to perform non-graphical computations on the GPU. Before its appearance, developers using the capabilities of video chips for calculations chose one of two common APIs: Direct3D or OpenGL. This seriously limited the use of the GPU, because 3D graphics use shaders and textures that parallel programmers are not required to know about, they use threads and cores. Brook was able to help make their task easier. These streaming extensions to the C language, developed at Stanford University, hid the 3D API from programmers and presented the video chip as a parallel coprocessor. The compiler parsed a .br file with C++ code and extensions, producing code linked to a DirectX, OpenGL, or x86-enabled library.

Naturally, Brook had many shortcomings, which we dwell on and which we will discuss in more detail later. But even just its appearance caused a significant surge of attention of the same Nvidia and ATI to the GPU computing initiative, since the development of these capabilities seriously changed the market in the future, opening up a whole new sector of it - parallel computing based on video chips.

Further, some researchers from the Brook project joined the Nvidia development team to introduce a hardware-software parallel computing strategy, opening up a new market share. And the main advantage of this Nvidia initiative was that the developers perfectly know all the capabilities of their GPUs to the smallest detail, and there is no need to use the graphics API, and you can work with the hardware directly using the driver. The result of this team's efforts is Nvidia CUDA (Compute Unified Device Architecture), a new hardware and software architecture for parallel computing on the Nvidia GPU, which is the subject of this article.

Areas of application of parallel computations on the GPU

To understand what advantages the transfer of calculations to video chips brings, we will present the average figures obtained by researchers around the world. On average, when transferring calculations to the GPU, in many tasks acceleration is achieved by 5-30 times compared to fast universal processors. The biggest numbers (of the order of 100x speedup and even more!) are achieved on code that is not very well suited for calculations using SSE blocks, but is quite convenient for the GPU.

These are just some examples of speedups of synthetic code on the GPU versus SSE vectorized code on the CPU (according to Nvidia):

Fluorescence microscopy: 12x;
Molecular dynamics (non-bonded force calc): 8-16x;
Electrostatics (direct and multi-level Coulomb summation): 40-120x and 7x.

And this is a plate that Nvidia loves very much, showing it at all presentations, which we will dwell on in more detail in the second part of the article, devoted to specific examples of practical applications of CUDA computing:

As you can see, the numbers are very attractive, especially the 100-150-fold gains are impressive. In the next CUDA article, we'll take a closer look at some of these numbers. And now we list the main applications in which GPU computing is now used: analysis and processing of images and signals, physics simulation, computational mathematics, computational biology, financial calculations, databases, dynamics of gases and liquids, cryptography, adaptive radiation therapy, astronomy, processing sound, bioinformatics, biological simulations, computer vision, data mining, digital cinema and television, electromagnetic simulations, geographic information systems, military applications, mining planning, molecular dynamics, magnetic resonance imaging (MRI), neural networks, oceanographic research, particle physics, protein folding simulation, quantum chemistry, ray tracing, imaging, radar, reservoir simulation, artificial intelligence, satellite data analysis, seismic exploration, surgery, ultrasound, video conferencing.

Details of many applications can be found on the Nvidia website in the section on . As you can see, the list is quite large, but that's not all! It can be continued, and it can certainly be assumed that in the future other areas of application of parallel calculations on video chips will be found, which we still have no idea about.

Nvidia CUDA Capabilities

CUDA technology is Nvidia's software and hardware computing architecture based on an extension of the C language, which makes it possible to access the instruction set of a graphics accelerator and manage its memory in parallel computing. CUDA helps to implement algorithms that can be implemented on graphic processors of Geforce video accelerators of the eighth generation and older (Geforce 8, Geforce 9, Geforce 200 series), as well as Quadro and Tesla.

Although the complexity of GPU programming with CUDA is quite high, it is lower than with early GPGPU solutions. Such programs require partitioning of the application across multiple multiprocessors similar to MPI programming, but without sharing the data that is stored in the shared video memory. And since CUDA programming for each multiprocessor is similar to OpenMP programming, it requires a good understanding of memory organization. But, of course, the complexity of developing and porting to CUDA is highly dependent on the application.

The developer kit contains many code examples and is well documented. The learning process will take about two to four weeks for those already familiar with OpenMP and MPI. The API is based on the extended C language, and to translate code from this language, the CUDA SDK includes the nvcc command-line compiler, based on the open Open64 compiler.

We list the main characteristics of CUDA:

unified software and hardware solution for parallel computing on Nvidia video chips;
a wide range of supported solutions, from mobile to multi-chip
the standard C programming language;
standard libraries for numerical analysis FFT (Fast Fourier Transform) and BLAS (Linear Algebra);
optimized data exchange between CPU and GPU;
interaction with graphics API OpenGL and DirectX;
support for 32- and 64-bit operating systems: Windows XP, Windows Vista, Linux and MacOS X;
the ability to develop at a low level.

Regarding the support of operating systems, it should be added that all major Linux distributions(Red Hat Enterprise Linux 3.x/4.x/5.x, SUSE Linux 10.x), but according to enthusiasts, CUDA works fine on other builds: Fedora Core, Ubuntu, Gentoo, etc.

The CUDA Development Environment (CUDA Toolkit) includes:

nvcc compiler;
FFT and BLAS libraries;
profiler;
gdb debugger for GPU;
CUDA runtime driver included with standard Nvidia drivers
programming manual;
CUDA Developer SDK (source code, utilities and documentation).

In the examples source code: parallel bitonic sort, matrix transposition, parallel prefix summation of large arrays, image convolution, discrete wavelet transform, example of interaction with OpenGL and Direct3D, use of CUBLAS and CUFFT libraries, option price calculation (Black-Scholes formula, binomial model , Monte Carlo method), Mersenne Twister parallel random number generator, large array histogram calculation, noise reduction, Sobel filter (edge finding).

Benefits and Limitations of CUDA

From a programmer's point of view, the graphics pipeline is a set of processing stages. The geometry block generates triangles, and the rasterization block generates pixels that are displayed on the monitor. The traditional GPGPU programming model is as follows:

To transfer computations to the GPU within the framework of such a model, a special approach is needed. Even element-by-element addition of two vectors will require drawing the shape to the screen or to an off-screen buffer. The figure is rasterized, the color of each pixel is calculated according to a given program (pixel shader). The program reads the input data from the textures for each pixel, adds them up, and writes them to the output buffer. And all these numerous operations are needed for what is written in a single operator in a conventional programming language!

Therefore, the use of GPGPU for general purpose computing has a limitation in the form of too much complexity for developers to learn. And there are enough other restrictions, because a pixel shader is just a formula for the dependence of the final color of a pixel on its coordinates, and the pixel shader language is a language for writing these formulas with a C-like syntax. The early GPGPU methods are a clever trick to harness the power of the GPU, but without any convenience. The data there is represented by images (textures), and the algorithm is represented by a rasterization process. It should be noted and a very specific model of memory and execution.

Nvidia's hardware and software architecture for computing on GPUs from Nvidia differs from previous GPGPU models in that it allows writing programs for GPUs in real C with standard syntax, pointers, and the need for a minimum of extensions to access the computing resources of video chips. CUDA does not depend on graphics APIs, and has some features designed specifically for general purpose computing.

Advantages of CUDA over the traditional approach to GPGPU computing:

the CUDA application programming interface is based on the standard C programming language with extensions, which simplifies the process of learning and implementing the CUDA architecture;
CUDA provides access to 16 KB of shared memory per multiprocessor, which can be used to organize a cache with a higher bandwidth than texture fetches;
more efficient data transfer between system and video memory
no need for graphics APIs with redundancy and overhead;
linear memory addressing, and gather and scatter, the ability to write to arbitrary addresses;
hardware support for integer and bit operations.

Main limitations of CUDA:

lack of recursion support for executable functions;
the minimum block width is 32 threads;
proprietary CUDA architecture owned by Nvidia.

The weaknesses of programming with previous GPGPU methods are that these methods do not use vertex shader execution units in previous non-unified architectures, data is stored in textures and output to an off-screen buffer, and multi-pass algorithms use pixel shader units. GPGPU limitations include: insufficiently efficient use of hardware capabilities, memory bandwidth limitations, no scatter operation (only gather), mandatory use of the graphics API.

The main advantages of CUDA over previous GPGPU methods stem from the fact that this architecture is designed to efficiently use non-graphics computing on the GPU and uses the C programming language without requiring algorithms to be ported to a form convenient for the concept of the graphics pipeline. CUDA offers a new GPU computing path that does not use graphics APIs, offering random memory access (scatter or gather). Such an architecture is free from the disadvantages of GPGPU and uses all the execution units, and also expands the capabilities through integer mathematics and bit shift operations.

In addition, CUDA opens up some hardware features not available from the graphics APIs, such as shared memory. This is a small amount of memory (16 kilobytes per multiprocessor) that blocks of threads have access to. It allows you to cache the most frequently accessed data and can provide faster performance than using texture fetches for this task. This, in turn, reduces the throughput sensitivity of parallel algorithms in many applications. For example, it is useful for linear algebra, fast Fourier transform, and image processing filters.

More convenient in CUDA and memory access. The code in the graphics API outputs data as 32 single-precision floating-point values (RGBA values simultaneously to eight render targets) in predefined areas, and CUDA supports scatter recording - an unlimited number of records at any address. Such advantages make it possible to execute some algorithms on the GPU that cannot be efficiently implemented using GPGPU methods based on the graphics API.

Also, graphical APIs necessarily store data in textures, which requires prior packing of large arrays into textures, which complicates the algorithm and forces the use of special addressing. And CUDA allows you to read data at any address. Another advantage of CUDA is the optimized communication between CPU and GPU. And for developers who want to access the low level (for example, when writing another programming language), CUDA offers the possibility of low-level assembly language programming.

History of CUDA development

CUDA development was announced along with the G80 chip in November 2006, and the release public beta The CUDA SDK took place in February 2007. Version 1.0 was released in June 2007 to launch Tesla solutions based on the G80 chip for the high performance computing market. Then, at the end of the year, CUDA 1.1 beta was released, which, despite a slight increase in version number, introduced quite a lot of new things.

From what appeared in CUDA 1.1, we can note the inclusion of CUDA functionality in regular Nvidia video drivers. This meant that in the requirements for any CUDA program, it was enough to specify a Geforce 8 series video card and higher, as well as the minimum driver version 169.xx. This is very important for developers, if these conditions are met, CUDA programs will work for any user. Also, asynchronous execution was added along with data copying (only for G84, G86, G92 and higher chips), asynchronous transfer of data to video memory, atomic memory access operations, support for 64-bit versions of Windows, and the possibility of multi-chip CUDA operation in SLI mode.

At the moment, the version for solutions based on the GT200 is CUDA 2.0, which was released along with the Geforce GTX 200 line. The beta version was released back in the spring of 2008. The second version has: support for double precision calculations (hardware support only for GT200), Windows Vista (32 and 64-bit versions) and Mac OS X are finally supported, debugging and profiling tools have been added, 3D textures are supported, optimized data transfer.

As for calculations with double precision, their speed on the current hardware generation is several times lower than single precision. The reasons are discussed in ours. The implementation of this support in the GT200 lies in the fact that FP32 blocks are not used to get results at a four times slower pace, to support FP64 calculations, Nvidia decided to make dedicated computing blocks. And in GT200 there are ten times less of them than FP32 blocks (one double precision block for each multiprocessor).

In reality, the performance can be even lower, since the architecture is optimized for 32-bit reading from memory and registers, in addition, double precision is not needed in graphics applications, and in the GT200 it is made more likely to just be. Yes, and modern quad-core processors show not much less real performance. But being even 10 times slower than single precision, this support is useful for mixed precision circuits. One common technique is to get initially approximate results in single precision, and then refine them in double precision. Now this can be done directly on the video card, without sending intermediate data to the CPU.

Another useful feature of CUDA 2.0 has nothing to do with the GPU, oddly enough. It's just now possible to compile CUDA code into highly efficient multi-threaded SSE code for fast execution on the CPU. That is, now this feature is suitable not only for debugging, but also real use on systems without an Nvidia graphics card. After all, the use of CUDA in normal code is constrained by the fact that Nvidia video cards, although the most popular among dedicated video solutions, are not available in all systems. And before version 2.0, in such cases, two different codes would have to be written: for CUDA and separately for the CPU. And now you can run any CUDA program on the CPU with high efficiency, albeit at a lower speed than on video chips.

Nvidia CUDA Supported Solutions

All CUDA-enabled graphics cards can help accelerate most demanding tasks, from audio and video processing to medical and scientific research. The only real limitation is that many CUDA programs require a minimum of 256 megabytes of video memory, and this is one of the most important specifications for CUDA applications.

An up-to-date list of CUDA-enabled products can be found at . At the time of this writing, CUDA calculations supported all products of the Geforce 200, Geforce 9 and Geforce 8 series, including mobile products, starting with Geforce 8400M, as well as Geforce 8100, 8200 and 8300 chipsets. Modern Quadro and all Tesla: S1070, C1060, C870, D870 and S870.

We especially note that along with the new Geforce GTX 260 and 280 video cards, the corresponding high-performance computing solutions were announced: Tesla C1060 and S1070 (shown in the photo above), which will be available for purchase this fall. The same GPU is used in them - GT200, in C1060 it is one, in S1070 - four. But, unlike gaming solutions, they use four gigabytes of memory per chip. Of the minuses, perhaps the lower memory frequency and memory bandwidth than gaming cards, providing 102 gigabytes / s per chip.

Composition of Nvidia CUDA

CUDA includes two APIs: high-level (CUDA Runtime API) and low-level (CUDA Driver API), although it is impossible to use both at the same time in one program, you must use one or the other. The high-level one works "on top" of the low-level one, all runtime calls are translated into simple instructions processed by the low-level Driver API. But even the “high-level” API assumes knowledge about the design and operation of Nvidia video chips; there is no too high level of abstraction there.

There is another level, even higher - two libraries:

CUBLAS- CUDA version of BLAS (Basic Linear Algebra Subprograms), designed for computing linear algebra problems and using direct access to GPU resources;

CUFFT- CUDA version of the Fast Fourier Transform library for calculating the Fast Fourier Transform, widely used in signal processing. The following transformation types are supported: complex-complex (C2C), real-complex (R2C), and complex-real (C2R).

Let's take a closer look at these libraries. CUBLAS are standard linear algebra algorithms translated into the CUDA language, at the moment only a certain set of basic CUBLAS functions are supported. The library is very easy to use: you need to create a matrix and vector objects in the video memory, fill them with data, call the required CUBLAS functions, and load the results from the video memory back into the system memory. CUBLAS contains special functions for creating and destroying objects in GPU memory, as well as for reading and writing data to this memory. Supported BLAS functions: levels 1, 2 and 3 for real numbers, level 1 CGEMM for complex. Level 1 is vector-vector operations, level 2 is vector-matrix operations, level 3 is matrix-matrix operations.

CUFFT - CUDA variant of the Fast Fourier Transform - widely used and very important in signal analysis, filtering, etc. CUFFT provides a simple interface for efficient FFT computation on Nvidia GPUs without the need to develop custom FFT for the GPU. CUDA FFT variant supports 1D, 2D, and 3D transformations of complex and real data, batch execution for multiple 1D transformations in parallel, sizes of 2D and 3D transformations can be within , for 1D a size of up to 8 million elements is supported.

Fundamentals of creating programs on CUDA

To understand the text below, you should understand the basic architectural features of Nvidia video chips. The GPU consists of several clusters of texture units (Texture Processing Cluster). Each cluster consists of an enlarged block of texture fetches and two or three streaming multiprocessors, each of which consists of eight computing devices and two superfunctional blocks. All instructions are executed according to the SIMD principle, when one instruction is applied to all threads in a warp (a term from the textile industry, in CUDA this is a group of 32 threads - the minimum amount of data processed by multiprocessors). This execution method was called SIMT (single instruction multiple threads - one instruction and many threads).

Each of the multiprocessors has certain resources. So, there is a special shared memory with a capacity of 16 kilobytes per multiprocessor. But this is not a cache, since the programmer can use it for any need, similar to the Local Store in the SPU of Cell processors. This shared memory allows information to be exchanged between threads of the same block. It is important that all threads of one block are always executed by the same multiprocessor. And threads from different blocks cannot exchange data, and you need to remember this limitation. Shared memory is often useful, except when multiple threads access the same memory bank. Multiprocessors can also access video memory, but with higher latency and poorer bandwidth. To speed up access and reduce the frequency of accessing video memory, multiprocessors have 8 kilobytes of cache for constants and texture data.

The multiprocessor uses 8192-16384 (for G8x/G9x and GT2xx, respectively) registers common to all threads of all blocks executing on it. The maximum number of blocks per multiprocessor for the G8x/G9x is eight, and the number of warps is 24 (768 threads per multiprocessor). In total, the top video cards of the Geforce 8 and 9 series can process up to 12288 threads at a time. The GeForce GTX 280 based on the GT200 offers up to 1024 threads per multiprocessor, it has 10 clusters of three multiprocessors processing up to 30720 threads. Knowing these limitations allows you to optimize algorithms for available resources.

The first step in porting an existing application to CUDA is profiling it and identifying areas of code that are bottlenecks that slow down work. If among such sections there are suitable ones for fast parallel execution, these functions are transferred to C and CUDA extensions for execution on the GPU. The program is compiled using the Nvidia-supplied compiler, which generates code for both the CPU and the GPU. When a program is executed, the CPU executes its portions of the code, and the GPU executes the CUDA code with the heaviest parallel computations. This part, designed for the GPU, is called the kernel (kernel). The kernel defines the operations to be performed on the data.

The video chip receives the core and creates copies for each data element. These copies are called threads. A stream contains a counter, registers, and state. For large amounts of data, such as image processing, millions of threads are launched. Threads run in groups of 32 called warps. Warps are assigned to run on specific streaming multiprocessors. Each multiprocessor consists of eight cores - stream processors that execute one MAD instruction per clock cycle. To execute one 32-thread warp, four multiprocessor cycles are required (we are talking about the shader domain frequency, which is 1.5 GHz and higher).

The multiprocessor is not a traditional multi-core processor, it is well suited for multi-threading, supporting up to 32 warps at a time. Each clock cycle, the hardware chooses which of the warps to execute, and switches from one to another without losing cycles. If we draw an analogy with the central processor, this is like executing 32 programs at the same time and switching between them every clock cycle without the loss of a context switch. In reality, the CPU cores support the simultaneous execution of one program and switch to others with a delay of hundreds of cycles.

CUDA Programming Model

Again, CUDA uses a parallel computing model, where each of the SIMD processors executes the same instruction on different data items in parallel. The GPU is a computing device, a coprocessor (device) for the central processor (host), which has its own memory and processes a large number of threads in parallel. The kernel (kernel) is a function for the GPU, executed by threads (an analogy from 3D graphics - a shader).

We said above that a video chip differs from a CPU in that it can process tens of thousands of threads simultaneously, which is usually for graphics that are well parallelized. Each stream is scalar, does not require data to be packed into 4-component vectors, which is more convenient for most tasks. The number of logical threads and thread blocks exceeds the number of physical execution units, which gives good scalability for the entire range of company solutions.

The programming model in CUDA assumes thread grouping. Threads are combined into thread blocks - one-dimensional or two-dimensional grids of threads interacting with each other using shared memory and synchronization points. The program (kernel) is executed over a grid of thread blocks, see the figure below. One grid is executed at the same time. Each block can be one-, two-, or three-dimensional in shape, and may consist of 512 threads on current hardware.

Thread blocks run in small groups called warps, which are 32 threads in size. This is the minimum amount of data that can be processed in multiprocessors. And since this is not always convenient, CUDA allows you to work with blocks containing from 64 to 512 threads.

Grouping blocks into grids allows you to get away from the limitations and apply the kernel to a larger number of threads in one call. It also helps with scaling. If the GPU does not have enough resources, it will execute blocks sequentially. Otherwise, the blocks can be executed in parallel, which is important for the optimal distribution of work on video chips of different levels, ranging from mobile and integrated ones.

CUDA memory model

The memory model in CUDA is distinguished by the possibility of byte addressing, support for both gather and scatter. A fairly large number of registers are available for each stream processor, up to 1024 pieces. Access to them is very fast, you can store 32-bit integers or floating point numbers in them.

Each thread has access to the following types of memory:

global memory- the largest amount of memory available for all multiprocessors on a video chip, the size ranges from 256 megabytes to 1.5 gigabytes for current solutions (and up to 4 GB for Tesla). It has a high throughput, more than 100 gigabytes / s for top Nvidia solutions, but very large delays of several hundred cycles. Not cacheable, supports generic load and store instructions, and regular memory pointers.

local memory is a small amount of memory that only one stream processor has access to. It is relatively slow - the same as the global one.

Shared memory is a 16-kilobyte (in the video chips of the current architecture) memory block with shared access for all stream processors in the multiprocessor. This memory is very fast, the same as registers. It provides thread interaction, is directly managed by the developer, and has low latency. Advantages of shared memory: use in the form of a first-level cache managed by the programmer, reducing delays in accessing data by execution units (ALUs), reducing the number of global memory accesses.

Constant memory- a 64 kilobyte memory area (the same for current GPUs), read-only by all multiprocessors. It is cached at 8 kilobytes per multiprocessor. Quite slow - a delay of several hundred cycles in the absence of the necessary data in the cache.

texture memory- a block of memory available for reading by all multiprocessors. Data sampling is carried out using the texture units of the video chip, so the possibility of linear data interpolation is provided at no additional cost. 8 kilobytes cached per multiprocessor. Slow as global - hundreds of cycles of delay in the absence of data in the cache.

Naturally, the global, local, texture, and constant memory are physically the same memory, known as the video card's local video memory. Their differences are in different caching algorithms and access models. The CPU can update and query only external memory: global, constant and texture.

From what has been written above, it is clear that CUDA implies a special approach to development, not quite the same as that adopted in programs for the CPU. You need to remember about different types of memory, that local and global memory are not cached and the delays in accessing it are much higher than for registered memory, since it is physically located in separate microcircuits.

A typical, but not mandatory, problem solving pattern:

the task is divided into subtasks;
the input data is divided into blocks that fit into shared memory;
each block is processed by a thread block;
the subblock is loaded into shared memory from the global one;
corresponding calculations are performed on data in shared memory;
the results are copied from shared memory back to global.

Programming environment

CUDA includes runtime libraries:

a common part that provides built-in vector types and subsets of RTL calls supported on the CPU and GPU;
CPU component, to control one or more GPUs;
A GPU component that provides GPU-specific functionality.

The main process of the CUDA application runs on a generic processor (host), it runs multiple copies of the kernel processes on the video card. The code for the CPU does the following: initializes the GPU, allocates memory on the video card and system, copies the constants to the memory of the video card, runs several copies of the kernel processes on the video card, copies the result from the video memory, frees the memory and exits.

As an example for understanding, here is the CPU code for vector addition presented in CUDA:

The functions executed by the video chip have the following limitations: no recursion, no static variables inside the functions, and no variable number of arguments. Two types of memory management are supported: linear memory accessed by 32-bit pointers, and CUDA arrays accessed only through texture fetching functions.

CUDA programs can interact with graphics APIs: to render data generated in the program, to read rendering results and process them using CUDA tools (for example, when implementing post-processing filters). To do this, graphics API resources can be mapped (obtaining a resource address) into the CUDA global memory space. The following types of graphics API resources are supported: Buffer Objects (PBO / VBO) in OpenGL, Vertex buffers and textures (2D, 3D and cubemaps) Direct3D9.

CUDA application compilation steps:

CUDA C source code files are compiled using the NVCC program, which wraps other tools and calls them: cudacc, g++, cl, etc. NVCC generates: CPU code that is compiled along with the rest of the application written in pure C, and the PTX object code for the video chip. Executable files with CUDA code necessarily require the presence of the CUDA runtime library (cudart) and CUDA core library (cuda).

Optimization of programs on CUDA

Naturally, within the framework of a review article, it is impossible to consider serious optimization issues in CUDA programming. Therefore, we will just briefly talk about the basic things. To effectively use the capabilities of CUDA, you need to forget about the usual methods of writing programs for the CPU, and use those algorithms that are well parallelized for thousands of threads. It is also important to find the optimal place for storing data (registers, shared memory, etc.), minimize data transfer between the CPU and GPU, and use buffering.

In general terms, when optimizing a CUDA program, one should try to achieve an optimal balance between the size and number of blocks. More threads in a block will reduce the impact of memory latency, but will also reduce the available number of registers. In addition, a block of 512 threads is inefficient, Nvidia itself recommends using blocks of 128 or 256 threads as a compromise value to achieve optimal latency and number of registers.

Among the main points of optimization of CUDA programs: as active use of shared memory as possible, since it is much faster than the global video memory of the video card; reads and writes from global memory should be coalesced whenever possible. To do this, you need to use special data types for reading and writing 32/64/128 bits of data at once in one operation. If read operations are difficult to merge, you can try using texture fetches.

conclusions

The hardware and software architecture presented by Nvidia for calculations on CUDA video chips is well suited for solving a wide range of tasks with high parallelism. CUDA runs on a large number of Nvidia video chips, and improves the GPU programming model by greatly simplifying it and adding a lot of features such as shared memory, the ability to synchronize threads, double precision calculations, and integer operations.

CUDA is a technology available to every software developer, it can be used by any programmer who knows the C language. You just have to get used to a different programming paradigm inherent in parallel computing. But if the algorithm is, in principle, well parallelized, then the study and time spent on CUDA programming will return in a multiple size.

It is likely that due to the widespread use of video cards in the world, the development of parallel computing on the GPU will greatly affect the high performance computing industry. These possibilities have already aroused great interest in scientific circles, and not only in them. After all, the potential for accelerating algorithms that lend themselves well to parallelization (on affordable hardware, which is no less important) at once by dozens of times is not so common.

General-purpose processors develop quite slowly, they don't have such performance spikes. In fact, although it sounds too loud, everyone who needs fast computers can now get an inexpensive personal supercomputer on their desk, sometimes without even investing additional funds, since Nvidia video cards are widely used. Not to mention the efficiency gains in terms of GFLOPS/$ and GFLOPS/W that GPU manufacturers love so much.

The future of many computing is clearly in parallel algorithms, almost all new solutions and initiatives are directed in this direction. So far, however, the development of new paradigms is on initial stage, you have to manually create threads and schedule memory access, which makes things more difficult than you would normally do in programming. But CUDA technology has taken a step in the right direction, and it clearly looks like a successful solution, especially if Nvidia manages to convince developers as much as possible of its benefits and prospects.

But, of course, GPUs will not replace CPUs. In their current form, they are not designed for this. Now that video chips are gradually moving towards the CPU, becoming more and more universal (calculations with single and double precision floating point, integer calculations), so the CPUs are becoming more and more “parallel”, acquiring a large number of cores, multithreading technologies, not to mention the appearance of blocks SIMD and heterogeneous processor projects. Most likely, the GPU and CPU will simply merge in the future. It is known that many companies, including Intel and AMD, are working on similar projects. And it doesn't matter if the GPU is consumed by the CPU, or vice versa.

In the article, we mainly talked about the benefits of CUDA. But there is also a fly in the ointment. One of the few disadvantages of CUDA is its poor portability. This architecture works only on the video chips of this company, and not on all of them, but starting from the Geforce 8 and 9 series and the corresponding Quadro and Tesla. Yes, there are a lot of such solutions in the world, Nvidia gives a figure of 90 million CUDA-compatible video chips. This is just great, but competitors offer their own solutions that are different from CUDA. So, AMD has Stream Computing, Intel will have Ct in the future.

Which of the technologies will win, become widespread and live longer than the rest - only time will tell. But CUDA has a good chance, because compared to Stream Computing, for example, it provides a more developed and easy-to-use programming environment in the regular C language. Perhaps a third party will help in determining by issuing some common decision. For example, in the next DirectX update under version 11, Microsoft promised compute shaders, which can become a kind of average solution that suits everyone, or almost everyone.

Judging by the preliminary data, this new type of shader borrows a lot from the CUDA model. And by programming in this environment now, you can gain immediate benefits and the necessary skills for the future. From a high performance computing point of view, DirectX also has the distinct disadvantage of poor portability, as the API is limited to the Windows platform. However, another standard is being developed - the open multi-platform initiative OpenCL, which is supported by most companies, including Nvidia, AMD, Intel, IBM and many others.

Don't forget that in the next CUDA article, you'll explore specific practical applications of scientific and other non-graphical computing performed by developers around the world using Nvidia CUDA.

Let's go back to history - go back to 2003, when Intel and AMD were in a joint race for the most powerful processor. In just a few years, clock speeds have risen significantly as a result of this race, especially after the release of the Intel Pentium 4.

But the race was quickly approaching the limit. After a wave of huge increases in clock speeds (between 2001 and 2003, the Pentium 4 clock speed doubled from 1.5 to 3 GHz), users had to be content with tenths of a gigahertz that manufacturers were able to squeeze out (from 2003 to 2005, clock speeds increased from only 3 to 3 .8 GHz).

Even architectures optimized for high clock speeds, such as Prescott, began to experience difficulties, and this time not only production ones. Chip makers just ran into the laws of physics. Some analysts even predicted that Moore's law would cease to operate. But that did not happen. The original meaning of the law is often misrepresented, but it refers to the number of transistors on the surface of a silicon core. For a long time, an increase in the number of transistors in the CPU was accompanied by a corresponding increase in performance - which led to a distortion of the meaning. But then the situation became more complicated. The designers of the CPU architecture approached the law of gain reduction: the number of transistors that needed to be added for the desired increase in performance became more and more, leading to a dead end.

While CPU makers have been tearing their hair out trying to find a solution to their problems, GPU makers have continued to benefit remarkably from the benefits of Moore's Law.

Why didn't they end up in the same dead end as the designers of the CPU architecture? The reason is very simple: CPUs are designed to get the best performance on a stream of instructions that process different data (both integers and floating point numbers), perform random memory access, and so on. Until now, developers have been trying to provide greater instruction parallelism - that is, to execute as many instructions as possible in parallel. So, for example, superscalar execution appeared with the Pentium, when under certain conditions it was possible to execute two instructions per clock. The Pentium Pro received out-of-order execution of instructions, which made it possible to optimize the performance of computing units. The problem is that the parallel execution of a sequential stream of instructions has obvious limitations, so blindly increasing the number of computing units does not give a gain, since most of the time they will still be idle.

On the contrary, the work of the GPU is relatively simple. It consists of taking a group of polygons on one side and generating a group of pixels on the other. Polygons and pixels are independent of each other, so they can be processed in parallel. Thus, in the GPU, it is possible to allocate a large part of the crystal for computing units, which, unlike the CPU, will actually be used.

Click on the picture to enlarge.

The GPU differs from the CPU not only in this. Memory access in the GPU is very coupled - if a texel is read, then after a few cycles, the adjacent texel will be read; when a pixel is written, the neighboring one will be written after a few cycles. By intelligently organizing memory, you can get performance close to the theoretical bandwidth. This means that the GPU, unlike the CPU, does not require a huge cache, since its role is to speed up texturing operations. All it takes is a few kilobytes containing a few texels used in bilinear and trilinear filters.

Click on the picture to enlarge.

Long live GeForce FX!

The two worlds remained separated for a long time. We used the CPU (or even multiple CPUs) for office tasks and Internet applications, and the GPU was well suited only to speed up rendering. But one feature changed everything: namely, the advent of programmable GPUs. At first, CPUs had nothing to fear. The first so-called programmable GPUs (NV20 and R200) were hardly a threat. The number of instructions in the program remained limited to about 10, they worked on very exotic data types, such as 9- or 12-bit fixed-point numbers.

Click on the picture to enlarge.

But Moore's law again showed its best side. The increase in the number of transistors not only increased the number of computing units, but also improved their flexibility. The appearance of the NV30 can be considered a significant step forward for several reasons. Of course, gamers did not really like the NV30 cards, but the new GPUs began to rely on two features that were designed to change the perception of GPUs as more than just graphics accelerators.

Support for single-precision floating-point calculations (even if it did not comply with the IEEE754 standard);
support for more than a thousand instructions.

So we got all the conditions that are able to attract pioneering researchers who always want to get additional computing power.

The idea of using graphics accelerators for mathematical calculations is not new. The first attempts were made in the 90s of the last century. Of course, they were very primitive - limited, for the most part, to the use of some hardware-based features, such as rasterization and Z-buffers to speed up tasks such as route search or output Voronoi diagrams .

Click on the picture to enlarge.

In 2003, with the advent of evolved shaders, a new bar was reached - this time performing matrix calculations. This was the year that an entire section of SIGGRAPH ("Computations on GPUs") was dedicated to the new area of IT. This early initiative was called GPGPU (General-Purpose computation on GPU). And the emergence of .

To understand the role of BrookGPU, you need to understand how everything happened before its appearance. The only way to get GPU resources in 2003 was to use one of two graphics APIs - Direct3D or OpenGL. Consequently, developers who wanted to get the power of the GPU for their computing had to rely on the two mentioned APIs. The problem is that they were not always experts in graphics card programming, and this made access to technology very difficult. If 3D programmers operate with shaders, textures, and fragments, then specialists in the field of parallel programming rely on threads, cores, scatters, etc. Therefore, at first it was necessary to draw analogies between the two worlds.

stream is a stream of elements of the same type, in the GPU it can be represented by a texture. In principle, in classical programming there is such an analogue as an array.
Kernel- a function that will be applied independently to each element of the stream; is the equivalent of a pixel shader. In classical programming, you can give an analogy for a cycle - it is applied to a large number of elements.
To read the results of applying a kernel to a stream, a texture must be created. There is no equivalent on the CPU, since there is full access to memory.
The location in memory to be written to (in scatter/scatter operations) is controlled through the vertex shader, since the pixel shader cannot change the coordinates of the processed pixel.

As you can see, even taking into account the above analogies, the task does not look simple. And Brook came to the rescue. This name refers to extensions to the C language ("C with streams", "C with streams"), as the developers at Stanford called them. At its core, Brook's task was to hide all the components of the 3D API from the programmer, which made it possible to present the GPU as a coprocessor for parallel computing. To do this, the Brook compiler processed a .br file with C++ code and extensions, and then generated C++ code that was linked to a library with support for different outputs (DirectX, OpenGL ARB, OpenGL NV3x, x86).

Click on the picture to enlarge.

Brook has several credits, the first of which is to bring GPGPU out of the shadows so that the technology can be seen by the general public. Although, after the announcement of the project, a number of IT sites were too optimistic that the release of Brook casts doubt on the existence of CPUs, which will soon be replaced by more powerful GPUs. But, as we see, even after five years this did not happen. To be honest, we don't think it will ever happen at all. On the other hand, looking at the successful evolution of CPUs, which are increasingly oriented towards parallelism (more cores, SMT multithreading technology, expansion of SIMD blocks), as well as GPUs, which, on the contrary, are becoming more universal (support for floating point calculations). single precision, integer calculations, support for double precision calculations), it looks like the GPU and CPU will soon simply merge. What will happen then? Will GPUs be swallowed up by CPUs, as happened with math coprocessors? Quite possible. Intel and AMD are working on similar projects today. But a lot can still change.

But back to our topic. Brook's advantage was to popularize the concept of GPGPU, it greatly simplified access to GPU resources, which allowed more and more users to master the new programming model. On the other hand, despite all the qualities of Brook, there was still a long way to go before GPU resources could be used for computing.

One of the problems is related to different levels of abstraction, and also, in particular, to the excessive additional load created by the 3D API, which can be quite noticeable. But more serious can be considered a compatibility problem, with which the Brook developers could not do anything. There is fierce competition between GPU manufacturers, so they often optimize their drivers. If such optimizations are, for the most part, good for gamers, they could do away with Brook compatibility in an instant. Therefore, it is difficult to imagine the use of this API in industrial code that will work somewhere. And for a long time, Brook remained the lot of amateur researchers and programmers.

However, Brook's success was enough to attract the attention of ATI and Nvidia, and they were interested in such an initiative, as it could expand the market, opening up a new important sector for companies.

Researchers originally involved in the Brook project quickly joined the development teams in Santa Clara to present a global strategy for developing a new market. The idea was to create a combination of hardware and software suitable for GPGPU tasks. Since the developers of nVidia know all the secrets of their GPUs, it was possible not to rely on the graphics API, but to communicate with the graphics processor through the driver. Although, of course, this comes with its own problems. So, the CUDA (Compute Unified Device Architecture) development team has created a set of software layers for working with the GPU.

Click on the picture to enlarge.

As you can see in the diagram, CUDA provides two APIs.

High-level API: CUDA Runtime API;
low-level API: CUDA Driver API.

Since the high-level API is implemented on top of the low-level one, each runtime function call is broken down into simpler instructions that are processed by the Driver API. Note that the two APIs are mutually exclusive: a programmer can use one or the other API, but it is not possible to mix the function calls of the two APIs. In general, the term "high-level API" is relative. Even the Runtime API is such that many will consider it low-level; however, it still provides functions that are very convenient for initialization or context management. But don't expect a particularly high level of abstraction - you still need to have a good amount of knowledge about nVidia GPUs and how they work.

The Driver API is even harder to work with; you need more effort to run GPU processing. On the other hand, the low-level API is more flexible, giving the programmer more control if needed. Two APIs are capable of working with OpenGL or Direct3D resources (only the ninth version as of today). The benefit of this feature is obvious - CUDA can be used to create resources (geometry, procedural textures, etc.) that can be passed to the graphics API or, conversely, the 3D API can be made to send the rendering results to the CUDA program, which, in turn, will perform post-processing. There are many examples of such interactions, and the advantage is that resources continue to be stored in GPU memory, they do not need to be transferred through the PCI Express bus, which is still a bottleneck.

However, it should be noted that the sharing of resources in video memory is not always ideal and can lead to some "headaches". For example, when changing resolution or color depth, graphic data takes precedence. Therefore, if it is required to increase the resources in the frame buffer, then the driver will easily do this at the expense of the resources of the CUDA applications, which will simply crash with an error. Of course, not very elegant, but this situation should not happen very often. And since we started talking about the disadvantages: if you want to use multiple GPUs for CUDA applications, then you need to first disable SLI mode, otherwise CUDA applications will only be able to "see" one GPU.

Finally, the third software level is given to libraries - two, to be exact.

CUBLAS, which contains the necessary blocks for computing linear algebra on the GPU;
CUFFT, which supports the calculation of Fourier transforms, is an algorithm widely used in the field of signal processing.

Before we dive into CUDA, let's define a few terms scattered throughout the nVidia documentation. The company has chosen a very specific terminology that is hard to get used to. First of all, we note that thread in CUDA is nowhere near the same meaning as a CPU thread, nor is it the equivalent of a thread in our GPU articles. The GPU thread in this case is basic set data to be processed. Unlike CPU threads, CUDA threads are very "light", meaning a context switch between two threads is by no means a resource-intensive operation.

The second term often found in the CUDA documentation is warp. There is no confusion here, since there is no analogue in Russian (unless you are a fan of Start Trek or Warhammer games). In fact, the term is taken from the textile industry, where weft yarn is pulled through the warp yarn, which is stretched on the loom. Warp in CUDA is a group of 32 threads and is the minimum amount of data processed in the SIMD way in CUDA multiprocessors.

But such "graininess" is not always convenient for the programmer. Therefore, in CUDA, instead of working with warps directly, you can work with block, containing from 64 to 512 threads.

Finally, these blocks are brought together in grids. The advantage of this grouping is that the number of blocks processed by the GPU at the same time is closely related to the hardware resources, as we will see below. Grouping blocks into grids allows you to completely abstract from this limitation and apply the kernel / kernel to more threads in one call, without thinking about fixed resources. The CUDA libraries are responsible for all this. In addition, this model scales well. If the GPU is low on resources, then it will execute blocks sequentially. If the number of computing processors is large, then blocks can be executed in parallel. That is, the same code can run on both entry-level GPUs and top-end and even future models.

There are a couple more terms in the CUDA API that stand for CPU ( host/host) and GPU ( device/device). If this little introduction didn't scare you, then it's time to get to know CUDA better.

If you regularly read Tom's Hardware Guide, then the architecture of the latest GPUs from nVidia is familiar to you. If not, we recommend that you read the article " nVidia GeForce GTX 260 and 280: a new generation of graphics cards". With regard to CUDA, nVidia presents the architecture a little differently, showing some details that were previously hidden.

As you can see from the illustration above, the nVidia shader core consists of several clusters of texture processors. (Texture Processor Cluster, TPC). The 8800 GTX, for example, used eight clusters, the 8800 GTS used six, and so on. Each cluster essentially consists of a texture unit and two streaming multiprocessors. The latter include the beginning of the pipeline (front end), which reads and decodes instructions, as well as sending them for execution, and the end of the pipeline (back end), consisting of eight computing devices and two superfunctional devices. SFU (Super Function Unit), where instructions are executed according to the SIMD principle, that is, one instruction is applied to all threads in the warp. nVidia calls this way of doing it SIMT(single instruction multiple threads, one instruction, many threads). It is important to note that the end of the pipeline operates at twice the frequency of its beginning. In practice, this means that the part looks twice as "wider" than it actually is (i.e. like a 16-channel SIMD block instead of an eight-channel one). Streaming multiprocessors work like this: each cycle, the beginning of the pipeline selects a warp ready for execution and starts executing an instruction. For an instruction to apply to all 32 threads in the warp, the end of the pipeline would take four clock cycles, but since it runs at twice the frequency of the start, it would only take two clock cycles (in terms of the beginning of the pipeline). Therefore, so that the beginning of the pipeline does not stand idle for a cycle, and the hardware is maximally loaded, in the ideal case, you can alternate instructions each cycle - a classic instruction in one cycle and an instruction for SFU - in another.

Each multiprocessor has a certain set of resources that are worth understanding. There is a small area of memory called "Shared Memory", 16 KB per multiprocessor. This is by no means cache memory: the programmer can use it at his discretion. That is, we have something close to the Local Store of the SPU on Cell processors. This detail is quite interesting as it emphasizes that CUDA is a combination of software and hardware technologies. This area of memory is not used for pixel shaders, as Nvidia cleverly points out "we don't like it when pixels talk to each other".

This area of memory opens up the possibility of exchanging information between threads. in one block. It is important to emphasize this limitation: all threads in a block are guaranteed to be executed by a single multiprocessor. On the contrary, the binding of blocks to different multiprocessors is not specified at all, and two threads from different blocks cannot exchange information with each other during execution. That is, using shared memory is not so simple. However, shared memory is still justified, except when multiple threads try to access the same memory bank, causing a conflict. In other situations, shared memory access is as fast as register access.

Shared memory is not the only one that can be accessed by multiprocessors. They can use video memory, but with lower bandwidth and higher latency. Therefore, in order to reduce the frequency of accessing this memory, nVidia equipped multiprocessors with a cache (about 8 KB per multiprocessor) that stores constants and textures.

The multiprocessor has 8,192 registers that are common to all threads of all blocks active on the multiprocessor. The number of active blocks per multiprocessor cannot exceed eight, and the number of active warps is limited to 24 (768 threads). Therefore, the 8800 GTX can process up to 12,288 threads at one time. All of these limitations are worth mentioning as they allow the algorithm to be optimized based on available resources.

Optimization of the CUDA program, therefore, consists in obtaining an optimal balance between the number of blocks and their size. More threads per block would be helpful in reducing memory latency, but the number of registers available per thread would also be reduced. Moreover, a block of 512 threads would be inefficient, since only one block could be active on a multiprocessor, resulting in a loss of 256 threads. Therefore, nVidia recommends using blocks of 128 or 256 threads, which gives the best compromise between lower latency and the number of registers for most cores / kernels.

From a programmatic point of view, CUDA consists of a set of extensions to the C language, which resembles BrookGPU, as well as several specific API calls. Among the extensions are type specifiers related to functions and variables. It is important to remember the keyword __global__, which, when given in front of the function, indicates that the latter refers to the kernel / kernel - this function will be called by the CPU, and it will be executed on the GPU. Prefix __device__ specifies that the function will be executed on the GPU (which CUDA calls "device/device" by the way) but it can only be called from the GPU (in other words, from another __device__ function or from a __global__ function). Finally, the prefix __host__ optional, it denotes a function that is called by the CPU and executed by the CPU - in other words, a regular function.

There are a number of restrictions associated with the __device__ and __global__ functions: they cannot be recursive (that is, call themselves), and they cannot have a variable number of arguments. Finally, since the __device__ functions reside in GPU memory space, it makes sense that their address cannot be retrieved. Variables also have a number of qualifiers that indicate the memory location where they will be stored. Variable with prefix __shared__ means that it will be stored in the shared memory of the streaming multiprocessor. The call to the __global__ function is slightly different. The thing is, when calling, you need to set the execution configuration - more specifically, the size of the grid / grid to which the kernel / kernel will be applied, as well as the size of each block. Take, for example, a kernel with the following signature.

__global__ void Func(float* parameter);

It will be called as

Func<<< Dg, Db >>> (parameter);

where Dg is the grid size and Db is the block size. These two variables refer to the new vector type introduced with CUDA.

The CUDA API contains functions for working with memory in VRAM: cudaMalloc for allocating memory, cudaFree for freeing, and cudaMemcpy for copying memory between RAM and VRAM and vice versa.

We will end this review with a very interesting way in which a CUDA program is compiled: compilation is done in several steps. First, the CPU-specific code is extracted and passed to the standard compiler. Code destined for the GPU is first converted to the PTX intermediate language. It is similar to assembly language and allows you to study the code in search of potential inefficient sections. Finally, the last phase is to translate the intermediate language into GPU specific instructions and create the binary.

Looking through the nVidia documentation makes me want to try CUDA this week. Indeed, what could be better than evaluating an API by creating your own program? That's when most of the problems should come to the surface, even if everything looks perfect on paper. In addition, practice will best show how well you understand all the principles outlined in the CUDA documentation.

It is quite easy to dive into such a project. Today, a large number of free, but high-quality tools are available for download. For our test, we used Visual C++ Express 2005, which has everything you need. The hardest part was finding a program that didn't take weeks to port to the GPU, but was interesting enough that our efforts weren't wasted. In the end, we chose a piece of code that takes a height map and calculates the corresponding normal map. We will not delve into this function in detail, since it is hardly interesting in this article. In short, the program deals with curvature of areas: for each pixel of the initial image, we impose a matrix that determines the color of the resulting pixel in the generated image from adjacent pixels, using a more or less complex formula. The advantage of this function is that it is very easy to parallelize it, so given test perfectly shows the possibilities of CUDA.

Another advantage is that we already have an implementation on the CPU, so we can compare its result with the CUDA version - and not reinvent the wheel.

We repeat once again that the purpose of the test was to get acquainted with the CUDA SDK utilities, and not to compare the versions for the CPU and GPU. Since this was our first attempt at creating a CUDA program, we didn't really expect high performance. Since this part of the code is not critical, the CPU version was not optimized, so a direct comparison of the results is hardly interesting.

Performance

However, we measured the execution time to see if there is an advantage to using CUDA even with the most rough implementation, or if we need long and tedious practice to get any gain when using the GPU. The test machine was taken from our development lab - a laptop with a Core 2 Duo T5450 processor and a GeForce 8600M GT graphics card running Vista. This is far from being a supercomputer, but the results are very interesting, since the test is not "sharpened" for the GPU. It's always nice to see nVidia show huge gains on systems with monstrous GPUs and a lot of bandwidth, but in practice, many of the 70 million CUDA-enabled GPUs in today's PC market are nowhere near as powerful, so our test is justified.

For a 2048 x 2048 pixel image, we got the following results.

CPU 1 thread: 1419ms;
CPU 2 threads: 749ms;
CPU 4 threads: 593ms
GPU (8600M GT) blocks of 256 threads: 109ms;
GPU (8600M GT) blocks of 128 threads: 94ms;
GPU (8800 GTX) blocks of 128 threads / 256 threads: 31ms.

Several conclusions can be drawn from the results. Let's start with the fact that, despite the talk about the obvious laziness of programmers, we modified the initial version of the CPU for several threads. As we already mentioned, the code is ideal for this situation - all that is required is to split the initial image into as many zones as there are streams. Note that from switching from one thread to two threads on our dual-core CPU, the acceleration is almost linear, which also indicates the parallel nature of the test program. Quite unexpectedly, the version with four threads was also faster, although this is very strange on our processor - on the contrary, one could expect a drop in efficiency due to the overhead of managing additional threads. How can such a result be explained? It's hard to say, but the Windows thread scheduler is probably the culprit; in any case, the result is repeatable. With smaller textures (512x512), the gain from splitting was not as pronounced (about 35% versus 100%), and the behavior of the four-thread version was more logical, with no gain compared to the two-thread version. The GPU was still faster, but not as pronounced (the 8600M GT was three times faster than the dual threaded version).

Click on the picture to enlarge.

The second significant observation is that even the slowest implementation of the GPU turned out to be almost six times faster than the highest performing version of the CPU. For the first program and the unoptimized version of the algorithm, the result is very encouraging. Please note that we received tangibly best result on small blocks, although intuition may suggest otherwise. The explanation is simple - our program uses 14 registers per thread, and with 256-thread blocks, 3,584 registers per block are required, and 768 threads are required for full processor load, as we showed. In our case, this is three blocks or 10,572 registers. But the multiprocessor only has 8,192 registers, so it can only keep two blocks active. In contrast, with blocks of 128 threads, we need 1,792 registers per block; if 8,192 is divided by 1,792 and rounded up to the nearest integer, we get four blocks. In practice, the number of threads will be the same (512 per multiprocessor, although 768 is theoretically needed for a full load), but increasing the number of blocks gives the GPU the advantage of flexibility in memory access - when an operation is in progress with large delays, it can start executing the instructions of another block, waiting receipt of results. Four blocks clearly reduce latency, especially since our program uses multiple memory accesses.

Analysis

Finally, despite what we said above, we could not resist the temptation and ran the program on the 8800 GTX, which was three times faster than the 8600, regardless of the block size. You might think that in practice, on the corresponding architectures, the result will be four or more times higher: 128 ALUs / shader processors versus 32 and higher clock speeds (1.35 GHz versus 950 MHz), but this did not happen. Most likely, the limiting factor was memory access. To be more precise, the initial image is accessed as a multi-dimensional CUDA array - a rather complicated term for something that is nothing more than a texture. But eating a few benefits.

accesses benefit from the texture cache;
we are using wrapping mode, which does not need to handle image borders, unlike the CPU version.

Also, we can benefit from "free" filtering with normalized addressing between instead of and , but this is hardly useful in our case. As you know, the 8600 has 16 texture units, compared to 32 for the 8800 GTX. Therefore, the ratio between the two architectures is only two to one. Add to that the frequency difference and we get a ratio of (32 x 0.575) / (16 x 0.475) = 2.4 - close to the "three to one" we actually got. This theory also explains why the size of the blocks does not change much on the G80, since the ALU still rests on the texture units.

Click on the picture to enlarge.

In addition to promising results, our first encounter with CUDA went very well, given the not-so-favorable conditions chosen. Developing on a Vista laptop means using CUDA SDK 2.0, which is still in beta, with driver 174.55, which is also in beta. Despite this, we cannot report any unpleasant surprises - only initial errors during the first debugging, when our program, still very "buggy", tried to address memory outside the allocated space.

The monitor started flickering wildly, then the screen went black...until Vista ran the driver repair service and everything was fine. But it's still somewhat surprising to see if you're used to seeing the typical Segmentation Fault on standard programs like ours. Finally, a small criticism towards nVidia: in all the documentation available for CUDA, there is no small guide that would walk you step by step on how to set up a development environment for Visual Studio. Actually, the problem is small, since the SDK has full set examples that can be explored to understand the framework for CUDA applications, but a beginner's guide would be nice.

Click on the picture to enlarge.

Nvidia introduced CUDA with the release of the GeForce 8800. And at the time the promises seemed very tempting, but we kept our enthusiasm to the real test. Indeed, at the time it seemed more like marking territory to stay on the GPGPU wave. Without an available SDK, it's hard to say that we're not facing another marketing dummy that won't work. This is not the first time that a good initiative was announced too early and at the time did not come to light due to a lack of support - especially in such a competitive sector. Now, a year and a half after the announcement, we can safely say that nVidia has kept its word.

The SDK went into beta fairly quickly in early 2007, and has been rapidly updated since then, proving the importance of this project to nVidia. Today, CUDA is developing very nicely: the SDK is already available in beta version 2.0 for major operating systems (Windows XP and Vista, Linux, as well as 1.1 for Mac OS X), and nVidia has dedicated a whole section of the site for developers.

On a more professional level, the impression of the first steps with CUDA turned out to be very positive. Even if you are familiar with the GPU architecture, you can easily figure it out. When an API looks clear at first glance, you immediately begin to believe that you will get convincing results. But won't computing time be wasted from multiple transfers from the CPU to the GPU? And how to use these thousands of threads with almost no synchronization primitive? We started our experiments with all these fears in mind. But they quickly dissipated when the first version of our algorithm, albeit a very trivial one, turned out to be significantly faster than on the CPU.

So CUDA is not a lifesaver for researchers who want to convince university officials to buy them a GeForce. CUDA is already a fully available technology that any C programmer can use if they are willing to put in the time and effort to get used to the new programming paradigm. These efforts will not be wasted if your algorithms parallelize well. We would also like to thank nVidia for providing complete and high-quality documentation, where novice CUDA programmers will find answers.

What does it take for CUDA to become a recognizable API? In one word: portability. We know that the future of IT lies in parallel computing - today everyone is already preparing for such changes, and all initiatives, both software and hardware, are directed in this direction. However, at the moment, if you look at the development of paradigms, we are still at the initial stage: we create threads manually and try to schedule access to shared resources; all this can somehow be dealt with if the number of cores can be counted on the fingers of one hand. But in a few years, when the number of processors will number in the hundreds, this possibility will no longer exist. With the release of CUDA, nVidia took the first step in solving this problem - but, of course, this decision suitable only for GPUs from this company, and even then not for everyone. Only the GF8 and 9 (and their Quadro/Tesla derivatives) can run CUDA programs today. And the new 260/280 line, of course.

Click on the picture to enlarge.

Nvidia may boast that it has sold 70 million CUDA-compatible GPUs worldwide, but that's still not enough to become the de facto standard. Taking into account the fact that competitors are not sitting idly by. AMD offers its own SDK (Stream Computing), and Intel has announced a solution (Ct), although it is not yet available. A standards war is coming, and there will clearly be no room in the market for three competitors until another player like Microsoft comes out with a common API proposal, which will of course make life easier for developers.

Therefore, nVidia has a lot of difficulties in the way of CUDA approval. Although technologically we have before us, without a doubt, a successful solution, it remains to convince the developers of its prospects - and this will not be easy. However, judging by the many recent API announcements and news, the future does not look bleak at all.

CUDA Technology

Vladimir Frolov,[email protected]

annotation

The article talks about CUDA technology, which allows a programmer to use video cards as powerful computing units. Tools provided by Nvidia make it possible to write graphics processing unit (GPU) programs in a subset of the C++ language. This saves the programmer from having to use shaders and understand how the graphics pipeline works. The article provides examples of programming using CUDA and various optimization techniques.

1. Introduction

The development of computing technologies over the past decades has been at a rapid pace. So fast that processor developers have almost come to the so-called "silicon dead end". The unbridled growth of the clock frequency became impossible due to a number of serious technological reasons.

This is partly why all manufacturers of modern computing systems are moving towards increasing the number of processors and cores, rather than increasing the frequency of one processor. The number of central processing unit (CPU) cores in advanced systems is already 8.

Another reason is the relatively low speed of the RAM. No matter how fast the processor works, the bottlenecks, as practice shows, are not arithmetic operations at all, but unsuccessful memory accesses - cache misses.

However, if you look in the direction of the GPU (Graphics Processing Unit), then they went along the path of parallelism much earlier. Today's video cards, such as the GF8800GTX, can have up to 128 processors. The performance of such systems, if programmed skillfully, can be quite significant (Fig. 1).

Rice. 1. Number of floating point operations for CPU and GPU

When the first video cards first appeared on the market, they were fairly simple (compared to the central processor), highly specialized devices designed to take the burden of visualizing two-dimensional data from the processor. With the development of the gaming industry and the emergence of such three-dimensional games as Doom (Fig. 2) and Wolfenstein 3D (Fig. 3), a need arose for 3D visualization.

Figures 2.3. Doom and Wolfenstein 3D games

Since the creation of the first Voodoo video cards by 3Dfx (1996) and until 2001, only a fixed set of operations on input data was implemented in the GPU.

Programmers had no choice in the rendering algorithm, and to increase flexibility, shaders appeared - small programs that are executed by the video card for each vertex or for each pixel. Their tasks included transformations over the vertices and shading - the calculation of illumination at a point, for example, according to the Phong model.

Although shaders are currently very advanced, it should be understood that they were developed for highly specialized 3D transformation and rasterization tasks. While GPUs are evolving towards universal multiprocessor systems, shader languages remain highly specialized.

They can be compared to the FORTRAN language in the sense that, like FORTRAN, they were the first, but designed to solve only one type of problem. Shaders are of little use for solving problems other than 3D transformations and rasterization, just as FORTRAN is not suitable for solving problems not related to numerical calculations.

Today there is a trend of non-traditional use of video cards for solving problems in the areas of quantum mechanics, artificial intelligence, physical calculations, cryptography, physically correct visualization, reconstruction from photographs, recognition, etc. These tasks are inconvenient to solve within the graphics APIs (DirectX, OpenGL), since these APIs were created for completely different applications.

The development of General Programming on GPU (GPGPU) logically led to the emergence of technologies aimed at a wider range of tasks than rasterization. As a result, Nvidia created the Compute Unified Device Architecture (or CUDA for short) technology, and rival ATI created the STREAM technology.

It should be noted that at the time of writing this article, STREAM technology was far behind CUDA in development, and therefore it will not be considered here. We will focus on CUDA, a GPGPU technology that allows you to write programs in a subset of the C++ language.

2. Fundamental difference between CPU and GPU

Let us briefly consider some of the significant differences between the areas and features of the applications of the central processor and the video card.

2.1. Opportunities

The CPU is originally adapted for solving general tasks and works with arbitrarily addressable memory. Programs on the CPU can directly access any cells of linear and homogeneous memory.

For GPUs, this is not the case. As you will learn after reading this article, there are as many as 6 types of memory in CUDA. You can read from any cell that is physically accessible, but not write to all cells. The reason is that the GPU is in any case a specific device designed for specific purposes. This restriction was introduced in order to increase the speed of certain algorithms and reduce the cost of equipment.

2.2. Memory performance

The age-old problem of most computing systems is that the memory is slower than the processor. CPU manufacturers solve it by introducing caches. The most frequently used areas of memory are placed in scratch memory or cache memory, running at the frequency of the processor. This allows you to save time when accessing the most frequently used data and load the processor with the actual calculations.

Note that caches are actually transparent to the programmer. Both when reading and when writing, data does not immediately go to RAM, but passes through caches. This allows, in particular, to quickly read some value immediately after writing.

On the GPU (here we mean GF eighth series video cards), there are also caches, and they are also important, but this mechanism is not as powerful as on the CPU. Firstly, not all types of memory are cached, and secondly, caches are read-only.

On the GPU, slow memory accesses are hidden using parallel computing. While some tasks are waiting for data, others are working, ready for calculations. This is one of the main principles of CUDA, which allows you to greatly improve the performance of the system as a whole.

3. CUDA core

3.1. streaming model

The computing architecture of CUDA is based on the conceptone command for a lot of data(Single Instruction Multiple Data, SIMD) and the concept multiprocessor.

The concept of SIMD implies that one instruction allows you to process many data at the same time. For example, the addps command on the Pentium 3 and newer Pentiums allows you to add 4 single precision floating point numbers at the same time.

A multiprocessor is a multi-core SIMD processor that allows only one instruction to be executed on all cores at any given time. Each multiprocessor core is scalar, i.e. it does not support vector operations in its purest form.

Before continuing, let's introduce a couple of definitions. Note that the device and host in this article will not be understood at all as most programmers are used to. We will use such terms in order to avoid discrepancies with the CUDA documentation.

Under the device (device) in our article, we will understand a video adapter that supports the CUDA driver, or another specialized device designed to execute programs using CUDA (such as, for example, NVIDIA Tesla). In our article, we will consider the GPU only as a logical device, avoiding specific implementation details.

Host (host) we will call a program in the ordinary RAM of a computer that uses the CPU and performs control functions for working with the device.

In fact, the part of your program that runs on the CPU is host, and your video card device. Logically, the device can be thought of as a set of multiprocessors (Figure 4) plus a CUDA driver.

Rice. 4. Device

Suppose that we want to run a certain procedure on our device in N threads (that is, we want to parallelize its work). According to the CUDA documentation, let's call this procedure the kernel.

A feature of the CUDA architecture is a block-grid organization, which is unusual for multi-threaded applications (Fig. 5). At the same time, the CUDA driver independently distributes device resources between threads.

Rice. 5. Organization of flows

On fig. 5. the core is designated as Kernel. All threads executing this core are combined into blocks (Block), and the blocks, in turn, are combined into a grid (Grid).

As seen in Figure 5, two-dimensional indices are used to identify flows. The CUDA developers have provided the ability to work with three-dimensional, two-dimensional or simple (one-dimensional) indexes, depending on what is more convenient for the programmer.

In general, indices are three-dimensional vectors. For each thread, the following will be known: the thread index inside the threadIdx block and the block index inside the blockIdx grid. At startup, all threads will differ only in these indices. In fact, it is through these indexes that the programmer exercises control, determining which part of his data is processed in each thread.

The answer to the question why the developers chose such an organization is not trivial. One reason is that one block is guaranteed to execute. on one device multiprocessor, but one multiprocessor can execute several different blocks. Other reasons will become clear later in the article.

A block of tasks (threads) is executed on the multiprocessor in parts, or pools, called warps. The current warp size in video cards with CUDA support is 32 threads. Tasks inside the warp pool are executed in SIMD style, i.e. all threads within a warp can execute only one instruction at a time.

One caveat should be made here. In architectures modern at the time of this writing, the number of processors inside one multiprocessor is 8, not 32. It follows that not the entire warp is executed at the same time, it is divided into 4 parts that are executed sequentially (because the processors are scalar) .

But, firstly, the CUDA developers do not strictly regulate the warp size. In their works, they mention the warp size parameter, and not the number 32. Secondly, from a logical point of view, it is the warp that is the minimum union of threads, about which we can say that all threads inside it are executed simultaneously - and at the same time, no assumptions are made about the rest system will not be done.

3.1.1. branching

The question immediately arises: if at the same time all the threads inside the warp execute the same instruction, then what about branches? After all, if branching occurs in the program code, then the instructions will already be different. Here, a standard solution for SIMD programming is used (Fig. 6).

Rice. 6. Organization of branching in SIMD

Let's have the following code:

if(cond)B;

In the case of SISD (Single Instruction Single Data), we execute statement A, check the condition, then execute statements B and D (if the condition is true).

Now let's say we have 10 threads executing in SIMD style. In all 10 threads, we execute statement A, then we check the cond condition and it turns out that in 9 out of 10 threads it is true, and in one thread it is false.

It is clear that we cannot run 9 threads to execute statement B, and one remaining thread to execute statement C, because only one instruction can be executed simultaneously in all threads. In this case, you need to do this: first, we “kill” the split thread so that it does not spoil anyone's data, and execute the 9 remaining threads. Then we “kill” 9 threads that executed statement B, and go through one thread with statement C. After that, the threads are again combined and execute statement D all at the same time.

It turns out a sad result: not only are processor resources spent on empty bit grinding in split streams, but, much worse, we will eventually be forced to execute BOTH branches.

However, not everything is as bad as it might seem at first glance. A very big advantage of the technology is that these tricks are performed dynamically by the CUDA driver and they are completely transparent to the programmer. At the same time, when dealing with SSE instructions of modern CPUs (namely, in the case of trying to execute 4 copies of the algorithm at the same time), the programmer himself must take care of the details: combine data by quadruples, do not forget about alignment, and generally write at a low level, in fact, like in assembler.

One very important conclusion follows from the foregoing. Branches do not cause performance degradation per se. Only branches where threads diverge within the same warp thread pool are harmful. Moreover, if the threads diverged within the same block, but in different warp pools, or inside different blocks, this has absolutely no effect.

3.1.2. Communication between threads

At the time of writing this article, any interaction between threads (synchronization and data exchange) was possible only within the block. That is, it is impossible to organize interaction between flows of different blocks using only documented features.

As for undocumented features, it is highly discouraged to use them. The reason for this is that they rely on the specific hardware features of a particular system.

Synchronization of all tasks within the block is performed by calling the __synchtreads function. Data exchange is possible through shared memory, since it is common for all tasks within the block.

3.2. Memory

There are six types of memory in CUDA (Fig. 7). These are registers, local, global, shared, constant and texture memory.

Such an abundance is due to the specifics of the video card and its primary purpose, as well as the desire of developers to make the system as cheap as possible, sacrificing either versatility or speed in various cases.

Rice. 7. Types of memory in CUDA

3.2.0. Registers

Whenever possible, the compiler tries to place all local variables of functions in registers. Access to such variables is carried out with maximum speed. In the current architecture, 8192 32-bit registers are available per multiprocessor. In order to determine how many registers are available to one thread, this number (8192) must be divided by the block size (the number of threads in it).

With a typical split of 64 threads per block, there are only 128 registers (there are some objective criteria, but 64 is suitable on average for many tasks). Realistically, nvcc will never allocate 128 registers. Usually it does not give more than 40, and the rest of the variables will go into local memory. This happens because several blocks can be executed on one multiprocessor. The compiler tries to maximize the number of concurrently running blocks. For greater efficiency, one should try to occupy less than 32 registers. Then, theoretically, 4 blocks (8 warps, if 64 threads in one block) can be launched on one multiprocessor. However, the amount of shared memory occupied by threads must also be taken into account here, since if one block occupies the entire shared memory, two such blocks cannot be executed simultaneously on a multiprocessor.

3.2.1. local memory

In cases where the local data of procedures is too large, or the compiler cannot calculate some constant step for them when they are accessed, it can place them in local memory. This can be facilitated, for example, by casting pointers for types of different sizes.

Physically, local memory is analogous to global memory, and works at the same speed. At the time of this writing, there were no mechanisms to explicitly prohibit the compiler from using local memory for specific variables. Since it is quite difficult to control local memory, it is better not to use it at all (see Section 4 “Optimization Recommendations”).

3.2.2. global memory

In the CUDA documentation as one of the main achievementstechnology provides the possibility of arbitrary addressing of global memory. That is, you can read from any memory cell, and you can also write to an arbitrary cell (this is usually not the case on a GPU).

However, for versatility in this case, you have to pay with speed. Global memory is not cached. It works very slowly, the number of global memory accesses should be minimized anyway.

Global memory is needed mainly to save the results of the program before sending them to the host (into ordinary DRAM memory). The reason for this is that global memory is the only kind of memory where you can write something.

Variables declared with the __global__ qualifier are placed in global memory. Global memory can also be allocated dynamically by calling the cudaMalloc(void* mem, int size) function on the host. This function cannot be called from the device. It follows that memory allocation should be handled by the host program running on the CPU. Data from the host can be sent to the device by calling the cudaMemcpy function:

cudaMemcpy(void* gpu_mem, void* cpu_mem, int size, cudaMemcpyHostToDevice);

In the same way, you can do the reverse procedure:

cudaMemcpy(void* cpu_mem, void* gpu_mem, int size, cudaMemcpyDeviceToHost);

This call is also made from the host.

When working with global memory, it is important to follow the coalescing rule. The main idea is that threads should access consecutive memory cells, and 4.8 or 16 bytes. In this case, the very first thread must address the address aligned to the boundary, respectively, 4.8 or 16 bytes. The addresses returned by cudaMalloc are aligned on at least a 256 byte boundary.

3.2.3. Shared memory

Shared memory is non-cached but fast memory. It is recommended to use it as a managed cache. Only 16KB of shared memory is available per multiprocessor. Dividing this number by the number of tasks in the block, we get maximum amount shared memory available per thread (if you plan to use it independently in all threads).

The hallmark of shared memory is that it is addressed the same for all tasks within a block (Figure 7). It follows that it can be used to exchange data between threads of only one block.

It is guaranteed that during the execution of the block on the multiprocessor the contents of the shared memory will be preserved. However, once a block has been changed on a multiprocessor, it is not guaranteed that the contents of the old block are preserved. Therefore, you should not try to synchronize tasks between blocks, leaving any data in shared memory and hoping for their safety.

Variables declared with the __shared__ qualifier are placed in shared memory.

shared_float mem_shared;

It should be emphasized again that there is only one shared memory for a block. Therefore, if you want to use it just as a managed cache, you should refer to different elements of the array, for example, like this:

float x = mem_shared;

Where threadIdx.x is the x index of the thread inside the block.

3.2.4. Constant memory

The constant memory is cached, as shown in Fig. 4. The cache exists in a single instance for one multiprocessor, which means that it is common for all tasks within the block. On the host, you can write something to constant memory by calling the cudaMemcpyToSymbol function. From the device, constant memory is read-only.

Constant memory is very convenient to use. You can place data of any type in it and read it with a simple assignment.

#define N 100

Constant__int gpu_buffer[N];

void host_function()

int cpu_buffer[N];

cudaMemcpyToSymbol(gpu_buffer, cpu_buffer, sizeof(int )*N);

// __global__ means that device_kernel is a kernel that can be run on the GPU

Global__void device_kernel()

int a = gpu_buffer;

int b = gpu_buffer + gpu_buffer;

// gpu_buffer = a; ERROR! constant memory is read-only

Since the cache is used for constant memory, access to it is generally quite fast. The only, but very big disadvantage of constant memory is that its size is only 64 KB (for the entire device). It follows from this that it makes sense to store only a small amount of frequently used data in context memory.

3.2.5. texture memory

Texture memory is cached (Fig. 4). There is only one cache for each multiprocessor, which means that this cache is shared by all tasks within the block.

The name of texture memory (and, unfortunately, functionality) is inherited from the concepts "texture" and "texturing". Texturing is the process of applying a texture (just a picture) to a polygon during rasterization. Texture memory is optimized for 2D data sampling and has the following features:

fast selection of fixed-size values (byte, word, double or quad word) from a one-dimensional or two-dimensional array;

normalized addressing with floats in the interval . You can then select them using normalized addressing. The resulting value will be a float4 word mapped to an interval ;

CudaMalloc((void**) &gpu_memory, N*sizeof (uint4 )); // allocate memory in the GPU

// setting texture parameters texture

Texture.addressMode = cudaAddressModeWrap; // mode Wrap

Texture.addressMode = cudaAddressModeWrap;

Texture.filterMode = cudaFilterModePoint; //closest value

texture.normalized = false; // do not use normalized addressing

CudaBindTexture(0, texture , gpu_memory , N ) // henceforth this memory will be considered as texture memory

cudaMemcpy(gpu_memory, cpu_buffer, N*sizeof(uint 4), cudaMemcpyHostToDevice ); // copy data toGPU

// __global__ means that device_kernel is the kernel to be parallelized

Global__void device_kernel()

uint4 a = tex1Dfetch(texture,0); // you can only select data this way!

uint4 b = tex1Dfetch(texture,1);

int c = a.x * b.y;

...

3.3. Simple example

As a simple example, consider the cppIntegration program from the CUDA SDK. It demonstrates how to work with CUDA, as well as the use of nvcc (Nvidia's special C++ subset compiler) in combination with MS Visual Studio, which greatly simplifies the development of CUDA programs.

4.1. Break down your task correctly

Not all tasks are suitable for SIMD architectures. If your task is not suitable for this, it may not be worth using a GPU. But if you are determined to use a GPU, you should try to break the algorithm into parts that can be efficiently executed in the SIMD style. If necessary, change the algorithm to solve your problem, come up with a new one - one that would fit well on SIMD. An example of a suitable use case for the GPU is the implementation of pyramidal addition of array elements.

4.2. Selecting the type of memory

Put your data in texture or constant memory if all tasks in the same block access the same memory location or close locations. Two-dimensional data can be efficiently processed using the text2Dfetch and text2D functions. Texture memory is specially optimized for 2D sampling.

Use global memory in combination with shared memory if all tasks randomly access different, widely spaced areas of memory (with very different addresses or coordinates if it is 2D/3D data).

global memory => shared memory

syncthreads();

Process data in shared memory

syncthreads();

global memory<= разделяемая память

4.3. Turn on memory counters

The --ptxas-options=-v compiler flag allows you to tell exactly how much and what kind of memory (registers, shared, local, const) you are using. If the compiler is using local memory, you know exactly what it is. Analyzing data about the amount and types of memory used can greatly help you optimize your program.

4.4. Try to minimize the use of registers and shared memory

The more the kernel uses registers or shared memory, the fewer threads (or rather warps) can simultaneously run on a multiprocessor, because. multiprocessor resources are limited. Therefore, a small increase in the occupancy of registers or shared memory can lead in some cases to a drop in performance by half - precisely because now exactly two times fewer warps are simultaneously executed on a multiprocessor.

4.5. Shared memory instead of local

If the Nvidia compiler for some reason has allocated data in local memory (usually this is noticeable by a very strong drop in performance in places where there is nothing resource-intensive), find out exactly what data got into local memory and put it in shared memory (shared memory ).

Often the compiler will allocate a variable in local memory if it is not used often. For example, this is a kind of accumulator where you accumulate a value by calculating something in a loop. If the loop is large in code size (but not in execution time!), then the compiler can put your accumulator in local memory, because it is relatively rarely used and there are few registers. The performance loss in this case can be noticeable.

If you really rarely use a variable, it is better to explicitly place it in global memory.

Although it may seem convenient for the compiler to automatically allocate such variables in local memory, it really isn't. It will not be easy to find a bottleneck with subsequent modifications of the program if the variable starts to be used more often. The compiler may or may not move such a variable into register memory. If the __global__ modifier is specified explicitly, the programmer is more likely to pay attention to it.

4.6. Loop unrolling

Loop unrolling is a standard performance trick in many systems. Its essence is to perform more actions at each iteration, thus reducing the total number of iterations, and hence the number of conditional jumps that the processor will have to perform.

Here's how you can unroll the loop for finding the sum of an array (for example, an integer):

int a[N]; intsum;

for (int i=0;i

Of course, loops can be unrolled manually (as shown above), but this is unproductive work. It's much better to use C++ templates in combination with inline functions.

template

class ArraySumm

Device__ static T exec(const T* arr) ( return arr + ArraySumm (arr+1); )

template

class ArraySumm<0,T>

Device__ static T exec(const T* arr) ( return 0; )

for (int i=0;i

summ+= ArraySumm<4,int>::exec(a);

One interesting feature of the nvcc compiler should be noted. The compiler will always inline functions like __device__ by default (there is a special __noinline__ directive to override this).

Therefore, you can be sure that an example like the one above will unfold into a simple sequence of statements, and will be in no way inferior in efficiency to hand-written code. However, in the general case (not nvcc), you can't be sure of this, since inline is just a hint to the compiler that it can ignore. Therefore, it is not guaranteed that your functions will be inlined.

4.7. Data alignment and 16-byte sampling

Align data structures on a 16-byte boundary. In this case, the compiler will be able to use special instructions for them that load data at once in 16 bytes.

If the structure is 8 bytes or less, you can align it to 8 bytes. But in this case, you can select two variables at once by combining two 8-byte variables into a structure using union or pointer casting. Casting should be used with care, as the compiler may place data in local memory rather than registers.

4.8. Shared memory bank conflicts

The shared memory is organized in the form of 16 (only!) memory banks with a step of 4 bytes. During the execution of the warp thread pool on a multiprocessor, it is divided into two halves (if warp-size = 32) of 16 threads that access shared memory in turn.

Tasks in different halves of a warp do not conflict over shared memory. Due to the fact that the tasks of one half of the warp pool will access the same memory banks, collisions will occur and, as a result, performance will drop. Tasks within the same warp half can access different areas of shared memory with a certain step.

The optimal steps are 4, 12, 28, ..., 2^n-4 bytes (Fig. 8).

Rice. 8. Optimal steps.

Not optimal steps are 1, 8, 16, 32, ..., 2^n bytes (Fig. 9).

Rice. 9. Sub-optimal steps

4.9. Minimizing Host Data Movements<=>device

Try to transfer intermediate results to host as little as possible for processing by the CPU. Implement, if not the entire algorithm, then at least its main part on the GPU, leaving the CPU with only control tasks.

5. CPU/GPU portable math library

The author of this article wrote a portable library MGML_MATH for working with simple spatial objects, the code of which is operable both on the device and on the host.

The MGML_MATH library can be used as a framework for writing portable (or hybrid) CPU/GPU systems for computing physical, graphical, or other spatial problems. Its main advantage is that the same code can be used both on the CPU and on the GPU, and at the same time, speed is put at the head of the requirements for the library.