Microprocessor Design/GPU

Microprocessor Design

GPU (Graphics Processing unit) is an electronic chip which is mounted on a video card (Graphics card). Occasionally called visual processing unit (VPU) is a specialized processor that offloads 3D graphics rendering from the microprocessor. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart.

Characteristics of GPU

Computational requirements are large. Real-time rendering requires billions of pixels per second, and each pixel requires hundreds or more operations. GPUs must deliver an enormous amount of compute performance to satisfy the demand of complex real-time applications.
Throughput is more important than latency. GPU implementations of the graphics pipeline prioritize throughput over latency. The human visual system operates on millisecond time scales, while operations within a modern processor take nanoseconds. This six-order-of-magnitude gap means that the latency of any individual operation is unimportant. As a consequence, the graphics pipeline is quite deep, perhaps hundreds to thousands of cycles, with thousands of primitives in flight at any given time.

GPU Architecture

The Graphics Pipeline

The input to the GPU is a list of geometric primitives, typically triangles, in a 3-D world coordinate system. Through many steps, those primitives are shaded and mapped onto the screen, where they are assembled to create a final picture. It is instructive to first explain the specific steps in the canonical pipeline before showing how the pipeline has become programmable.

Vertex Operations: The input primitives are formed from individual vertices. Each vertex must be transformed into screen space and shaded, typically through computing their interaction with the lights in the scene. Because typical scenes have tens to hundreds of thousands of vertices, and each vertex can be computed independently, this stage is well suited for parallel hardware.
Primitive Assembly: The vertices are assembled into triangles, the fundamental hardware-supported primitive in today’s GPUs.
Rasterization: Rasterization is the process of determining which screen-space pixel locations are covered by each triangle. Each triangle generates a primitive called a fragment at each screen-space pixel location that it covers. Because many triangles may overlap at any pixel location, each pixel’s color value may be computed from several fragments.
Fragment Operations: Using color information from the vertices and possibly fetching additional data from global memory in the form of textures (images that are mapped onto surfaces), each fragment is shaded to determine its final color. Just as in the vertex stage, each fragment can be computed in parallel. This stage is typically the most computationally demanding stage in the graphics pipeline.
Composition: Fragments are assembled into a final image with one color per pixel, usually by keeping the closest fragment to the camera for each pixel location

GPU Functions & Computing

The programmable units of the GPU follow a single program multiple-data (SPMD) programming model. For efficiency, the GPU processes many elements ("vertices or fragments) in parallel using the same program. Each element is independent from the other elements, and in the base programming model, elements cannot communicate with each other. All GPU programs must be structured in this way: many parallel elements, each processed in parallel by a single program. Each element can operate on 32-bit integer or floating point data with a reasonably complete general-purpose instruction set. Elements can read data from a shared global memory and, with the newest GPUs, also write back to arbitrary locations in shared global memory . This programming model is well suited to straight-line programs, as many elements can be processed in lockstep running the exact same code. Code written in this manner is single instruction, multiple data (SIMD).

GPU Accelerated Computing

GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications. GPU accelerated computing offers unprecedented application performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run significantly faster.

How GPU acceleration works

GPGPU Computing

GPGPU (General purpose computing on Graphics processing units) is a methodology for high-performance computing that uses graphics processing units to crunch data. The characteristics of graphics algorithms that have enabled the development of extremely high-performance special purpose graphics processors show up in other HPC algorithms. This same special-purpose hardware can be put to use accelerating those algorithms as well.

Applications

Algorithms well-suited to GPGPU implementation are those that exhibit two properties: they are data parallel and throughput intensive. Data parallel means that a processor can execute the operation on different data elements simultaneously. Throughput intensive means that the algorithm is going to process lots of data elements, so there will be plenty to operate on in parallel. Taking advantage of these two properties, GPUs achieve extreme performance by incorporating lots (hundreds) of relatively simple processing units to operate on many data elements simultaneously. Perhaps not surprisingly, pixel-based applications such as computer vision and video and image processing are very well suited to GPGPU technology, and for this reason, many of the commercial software packages in these areas now include GPGPU acceleration.

Difference between GPU and CPU

The CPU (central processing unit) has often been called the brains of the PC. But increasingly, that brain is being enhanced by another part of the PC – the GPU (graphics processing unit), which is its soul. Architecturally, the CPU is composed of a only few cores with lots of cache memory that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. The ability of a GPU with 100+ cores to process thousands of threads can accelerate some software by 100x over a CPU alone. GPU achieves this acceleration while being more powerful and cost-efficient than a CPU.

CPU and GPU

GPU APIs

VAAPI : VAAPI (Video Acceleration API) is an open-source library and API specification, which provides access to graphics hardware acceleration capabilities for video processing. It consists of a main library and driver-specific acceleration backends for each supported hardware vendor. Hardware Supported:

Intel® GMA X4500HD
Intel® HD Graphics (in Intel® 2010 Core™ i7/i5/i3 processor family)
Intel® HD Graphics 2000/3000 (in 2nd Generation Intel® Core™ i7/i5/i3 Processor family)
Intel® HD Graphics 2500/4000 (in 3nd Generation Intel® Core™ i7/i5/i3 Processor family)

Video Decode and Presentation API for Unix : The Video Decode and Presentation API for Unix (VDPAU) provides a complete solution for decoding, post-processing, compositing, and displaying compressed or uncompressed video streams. These video streams may be combined (composited) with bitmap content, to implement OSDs and other application user interfaces.

API Partitioning

VDPAU is split into two distinct modules:

Core API
Window System Integration Layer

The intent is that most VDPAU functionality exists and operates identically across all possible Windowing Systems. This functionality is the Core API. However, a small amount of functionality must be included that is tightly coupled to the underlying Windowing System. This functionality is the Window System Integration Layer. Possibly examples include:

Creation of the initial VDPAU VdpDevice handle, since this act requires intimate knowledge of the underlying Window System, such as specific display handle or driver identification.
Conversion of VDPAU surfaces to/from underlying Window System surface types, e.g. to allow manipulation of VDPAU-generated surfaces via native Window System APIs.

GPU Applications

Gaming : PC GPUs were originally invented for 3D gaming on PCs. Using modern GPUs has also enabled game developers to build animated characters that bring the maps to life.
Productivity : Microsoft Office 2010 now offers GPU acceleration for some of its graphical elements, like WordArt and PowerPoint transitions.
Video Editing : Video editing demands heavy use of system resources even on high end PCs. Consumer applications, like Adobe Premiere Elements 9, are offering features previously available only for professionals. Transitions like page curl, sphere or card flip are all GPU-accelerated in Premiere Elements 9. Effects like refraction and ripple are also accelerated by a GPU. A graphics card with an AMD Radeon GPU will speed up preview and final rendering, making it faster and more fun to create your video.

Sasha test

References

Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System, School of Informatics and Computing, Pervasive Technology Institute, Indiana University Bloomington by Hui Li, Geoffrey Fox, Gregor Laszewski, Zhenhua Guo, Judy Qiu
How GPUs Work by David Luebke, NVIDIA Research and Greg Humphreys, University of Virginia