Friday, February 6, 2015

Life of a triangle - NVIDIA's logical pipeline

Hi, while gathering public material on how the hardware works, I tried to create a compressed architecture image. It is based on images and information taken from the listed NVIDIA sources, it may not be free of errors but should help clear up some misconceptions (and hopefully not spawn more ;) ).

GPUs are super parallel work distributors

Why all this complexity? In graphics we have to deal with data amplification that creates lots of variable workloads. Each drawcall may generate a different amount of triangles. The amount of vertices after clipping is different from what our triangles were originally made of. After back-face and depth culling, not all triangles may need pixels on the screen. The screen size of a triangle can mean it requires millions of pixels or none at all.

As a consequence modern GPUs let their primitives (triangles, lines, points) follow a logical pipeline, not a physical pipeline. In the old days before G80's unified architecture (think DX9 hardware, ps3, xbox360), the pipeline was represented on the chip with the different stages and work would run through it one after another. G80 essentially reused some units for both vertex and fragment shader computations, depending on the load, but it still had a serial process for the primitives/rasterization and so on. With Fermi the pipeline became fully parallel, which means the chip implements a logical pipeline (the steps a triangle goes through) by reusing multiple engines on the chip.

Let's say we have two triangles A and B. Parts of their work could be in different logical pipeline steps. A has already been transformed and needs to be rasterized. Some of its pixels could be running pixel-shader instructions already, while others are being rejected by depth-buffer (Z-cull), others could be already being written to framebuffer, and some may actually wait. And next to all that, we could be fetching the vertices of triangle B. So while each triangle has to go through the logical steps, lots of them could be actively processed at different steps of their lifetime. The job (get drawcall's triangles on screen) is split into many smaller tasks and even subtasks that can run in parallel. Each task is scheduled to the resources that are available, which is not limited to tasks of a certain type (vertex-shading parallel to pixel-shading).

Think of a river that fans out. Parallel pipeline streams, that are independent of each other, everyone on their own time line, some may branch more than others. If we would color-code the units of a GPU based on the triangle, or drawcall it's currently working on, it would be multi-color blinkenlights :)

GPU architecture

Since Fermi NVIDIA has a similar principle architecture. There is a Giga Thread Engine which manages all the work that's going on. The GPU is partitioned into multiple GPCs (Graphics Processing Cluster), each has multiple SMs (Streaming Multiprocessor) and one Raster Engine. There is lots of interconnects in this process, most notably a Crossbar that allows work migration across GPCs or other functional units like ROP (render output unit) subsystems.

The work that a programmer thinks of (shader program execution) is done on the SMs. It contains many Cores which do the math operations for the threads. One thread could be a vertex-, or pixel-shader invocation for example. Those cores and other units are driven by Warp Schedulers, which manage groups of 32 threads as warps and hand over the instructions to be performed to Dispatch Units. The code logic is handled by the scheduler and not inside a core itself, which just sees something like "sum register 4234 with register 4235 and store in 4230" from the dispatcher. A core itself is rather dumb, compared to a CPU where a core is pretty smart. The GPU puts the smartness into higher levels, it conducts the work of an entire ensemble (or multiple if you will).

How many of these units are actually on the GPU (how many SMs per GPC, how many GPCs..) depends on the chip configuration itself. As you can see above GM204 has 4 GPCs with each 4 SMs, but Tegra X1 for example has 1 GPC and 2 SMs, both with Maxwell design. The SM design itself (number of cores, instruction units, schedulers...) has also changed over time from generation to generation (see first image) and helped making the chips so efficient they can be scaled from high-end desktop to notebook to mobile.

The logical pipeline

For the sake of simplicity several details are omitted. We assume the drawcall references some index- and vertexbuffer that is already filled with data and lives in the DRAM of the GPU and uses only vertex- and pixelshader (GL: fragmentshader).

  1. The program makes a drawcall in the graphics api (DX or GL). This reaches the driver at some point which does a bit of validation to check if things are "legal" and inserts the command in a GPU-readable encoding inside a pushbuffer. A lot of bottlenecks can happen here on the CPU side of things, which is why it is important programmers use apis well, and techniques that leverage the power of today's GPUs.
  2. After a while or explicit "flush" calls, the driver has buffered up enough work in a pushbuffer and sends it to be processed by the GPU (with some involvement of the OS). The Host Interface of the GPU picks up the commands which are processed via the Front End.
  3. We start our work distribution in the Primitive Distributor by processing the indices in the indexbuffer and generating triangle work batches that we send out to multiple GPCs.

  4. Within a GPC, the Poly Morph Engine of one of the SMs takes care of fetching the vertex data from the triangle indices (Vertex Fetch).
  5. After the data has been fetched, warps of 32 threads are scheduled inside the SM and will be working on the vertices.
  6. The SM's warp scheduler issues the instructions for the entire warp in-order. The threads run each instruction in lock-step and can be masked out individually if they should not actively execute it. There can be multiple reasons for requiring such masking. For example when the current instruction is part of the "if (true)" branch and the thread specific data evaluated "false", or when a loop's termination criteria was reached in one thread but not another. Therefore having lots of branch divergence in a shader can increase the time spent for all threads in the warp significantly. Threads cannot advance individually, only as a warp! Warps, however, are independent of each other.
  7. The warp's instruction may be completed at once or may take several dispatch turns. For example the SM typically has less units for load/store than doing basic math operations.
  8. As some instructions take longer to complete than others, especially memory loads, the warp scheduler may simply switch to another warp that is not waiting for memory. This is the key concept how GPUs overcome latency of memory reads, they simply switch out groups of active threads. To make this switching very fast, all threads managed by the scheduler have their own registers in the register-file. The more registers a shader program needs, the less threads/warps have space. The less warps we can switch between, the less useful work we can do while waiting for instructions to complete (foremost memory fetches).

  9. Once the warp has completed all instructions of the vertex-shader, its results are being processed by Viewport Transform. The triangle gets clipped by the clipspace volume and is ready for rasterization. We use L1 and L2 Caches for all this cross-task communication data.

  10. Now it gets exciting, our triangle is about to be chopped up and potentially leaving the GPC it currently lives on. The bounding box of the triangle is used to decide which raster engines need to work on it, as each engine covers multiple tiles of the screen. It gets sent out to one or multiple GPCs via the Work Distribution Crossbar. We effectively split our triangle into lots of smaller jobs now.

  11. Attribute Setup at the target SM will ensure that the interpolants (for example the outputs we generated in a vertex-shader) are in a pixel shader friendly format.
  12. The Raster Engine of a GPC works on the triangle it received and generates the pixel information for those sections that it is responsible for (also handles back-face culling and Z-cull).
  13. Again we batch up 32 pixel threads, or better say 8 times 2x2 pixel quads, which is the smallest unit we will always work with in pixel shaders. This 2x2 quad allows us to calculate derivatives for things like texture mip map filtering (big change in texture coordinates within quad causes higher mip). Those threads within the 2x2 quad whose sample locations are not actually covering the triangle, are masked out (gl_HelperInvocation). One of the local SM's warp scheduler will manage the pixel-shading task.
  14. The same warp scheduler instruction game, that we had in the vertex-shader logical stage, is now performed on the pixel-shader threads. The lock-step processing is particularly handy because we can access the values within a pixel quad almost for free, as all threads are guaranteed to have their data computed up to the same instruction point (NV_shader_thread_group).

  15. Are we there yet? Almost, our pixel-shader has completed the calculation of the colors to be written to the rendertargets and we also have a depth value. At this point we have to take the original api ordering of triangles into account before we hand that data over to one of the ROP (render output unit) subsystems, which in itself has multiple ROP units. Here depth-testing, blending with the framebuffer and so on is performed. These operations need to happen atomically (one color/depth set at a time) to ensure we don't have one triangle's color and another triangle's depth value when both cover the same pixel.
    NVIDIA typically applies memory compression, to reduce memory bandwidth requirements, which increases "effective" bandwidth (see GTX 980 pdf).
Puh! we are done, we have written some pixel into a rendertarget. I hope this information was helpful to understand some of the work/data flow within a GPU. It may also help understand another side-effect of why synchronization with CPU is really hurtful. One has to wait until everything is finished and no new work is submitted (all units become idle), that means when sending new work, it takes a while until everything is fully under load again, especially on the big GPUs.

In the image below you can see how we rendered a CAD model and colored it by the different SMs or warp ids that contributed to the image (NV_shader_thread_group). The result would not be frame-coherent, as the work distribution will vary frame to frame. The scene was rendered using many drawcalls, of which several may also be processed in parallel (using NSIGHT you can see some of that drawcall parallelism as well).

Further reading

Next to the white papers mentioned at the beginning, the article series "A trip through the graphics-pipeline" by Fabian Giesen is worth a read and there is also a quite in-depth talk on the details of the memory and instruction processing on the SM by Paulius Micikevicius. Pomegranate: A Fully Scalable Graphics Architecture describes the concept of parallel stages and work distribution between them.
This post here was motivated to help clear up some "serial issues" of version 1.1 of the very nicely-illustrated Render Hell by Simon Trümpler, looking forward to a new revision of that :)

Thursday, February 5, 2015

New OpenGL samples and techniques

It's been a while :) Though many cool things happened in the last year. Next to a very exciting roadtrip visiting several national parks in California with friends, I was glad to present at GTC Order Independent Transparency in San Jose and at SIGGRAPH rendering techniques in Vancouver. More recent work has been surfacing lately as well.

The NV_command_list extension has been disclosed at SIGGRAPH Asia, and I am very happy to work on it with Pierre Boudier and Tristan Lorach.

Several samples I've worked on can now be found at GitHub. More are to come (oh well readme writing and documentation...).




Thursday, January 16, 2014

Alles Im Fluss - open beta :)

For the start of the new year I am happy to announce that a long time project (dating back to 2008) finally is ready for public. "Alles Im Fluss" (everything flows), a 3dsmax plugin to aid poly modelling is available for open beta.

Alles Im Fluss provides the ability to quickly and easily draw polygon strips, connections or extrusions, and cap holes while maintaining clean, mostly quad-based topology.

One single tool provides you with all functionality depending on the sub-object type you are in, or keyboard modifiers used.

The tool provides you with control to refine the surface flow of connections or caps and replay drawn paths on other geometry.

Head over to and grab a copy for evaluation (fully-featured)!
Pricing is yet to be determined, however, you can expect it to be the cost of one game.

Hope to be able to update it for a bit, it's a nice "topic change" from my regular job around graphics programming at NVIDIA, back to the artist roots. Next goal is to be able to "pick up" paths from existing geometry and then replay.

Saturday, March 30, 2013

Simple GLSL compilation checker

As NVIDIA's Cgc is getting kinda dated (it is able to compile GLSL as well), threw together a simple commandline tool for basic offline compilation of GLSL shaders. Find it at GitHub repository

Sunday, June 3, 2012

tangent space can cost extra money

Although tangent space normal mapping is used for a while in games now. There is still often one major flaw left in the asset pipeline: Unsychronized Tangent Space (TS)

While TS as such is defined mathematically, and most people end up using similar (but not necessary same definition), it is a per-triangle feature. Therefore,the actual per-vertex storage can vary as well. There is different ways to smooth the vectors to a per-vertex attribute(just like vertex-normal smoothing sometimes may break geometric vertices open for hard-edges). Furthermore there is some typical optimization for actual display, such as reconstructing one of the vectors as cross product from the others, or avoiding per-pixel normalization of the matrix.

Major applications such as 3dsmax have suffered from this problem in past versions, the realtime display was not matched to the baker (only the offline renderer was perfect). Some developers such as id software had tools for this in the doom3 days, or CryTek (who document their tangent-space math quite well on the web). For a lot of other, even big players, there is no public information on the tangent space used in rendering.

This mismatch of "encoder/decoder" costs money. Artists spend extra time fixing interpolation issues, adding geometry, tweaking UV layouts... to get visual acceptable results. And yet often their preview (e.g. inside modeller) might still be "off" in the end (but close enough). As coder I might think "I know my math" but am unaware of the different baking tools and import/export issues. As artist I work with what I was given and am used to "work with limitations". This causes unnecessary frustration and can lead to dispute, if "one" side actually knows better.

And knowing better should be no problem today. Popular baking tools, such as xnormal, allow custom tangent space definitions. I've worked on enhancing the 3dsmax pipeline myself. The 3pointshader fixed the mismatch in old max versions, simply by encoding the "correct" tangent-space (synced to 3dsmax's default baker) as 3 UVW-channels. That way the realtime shader was matched to the baker. Accesing UVW data is also not too hard for import/export. Furthermore 3dsmax allows modifying the bake process through a plugin, and one could use this to use the same UVW-channel trick, or disable per-pixel normalization during baking (sample project with sources here)

So please for the sake of saving time (and money) and billions of "my normalmap looks wrong" worries by artists, all sides spend one day to talk it through, educate the artists what "bakers" they can use, educate the coders that their TS choice (all the nitty gritty details) matters for the asset pipeline. It might not have mattered in the bump-map days or when testing simple geometry, but once you bake complex high-res to low-res it does!

Friday, September 9, 2011

mini lua primer

-- tables are general container can be indexed by anything (numbers, 
-- tables, strings, functions...)

local function blah() end
tab[blah] = blubb  = blubb -- is same as
tab["name"] = blubb

-- tables are always passed as "pointers/references" never copied
-- array index starts with 1 !!
-- they become garbage collected when not referenced anymore

pos = {1,2,3}
a = { pos = pos }
pos[3] = 4
pos = {1,1,1} -- overwrites local variable pos
a.pos[3] -- is still 4

--[[ multiline 
comment ]]

blah = [==[ multiline string and comment 
can use multiple = for bracketing to nest  ]==]

--- multiple return values allow easy swapping
a,b = b,a

-- object oriented stuff
-- : operate passes first arg
a.func(a,blah) -- is same as 

-- metatables allow to index class tables
myclass = {}
myclassmeta = {__index = myclass}
function myclass:func() 
  self --automatic variable through : definiton for the first 
       --arg passed to func
-- above is equivalent to 
myclass.func = function (self) 


object = {}
object:func() -- is now same as

-- until func gets specialized per object
function object:func()
  -- lua will look up first in the object table, then in the metatable
  -- it will ony write to the object table

--- upvalues for function specialization

function func(obj)
  return function ()
    return obj * 2


a = func(1)
b = func(2)

a() -- returns 2
b() -- returns 4

--- non passed function arguments become nil automatically
function func (a,b)
  return a,b
a,b = func(1) -- b is "nil"

--- variable args
function func(...)
  local a,b = ...
  --- a,b would be first two args
  --- you can also put args in a table
  local t = {...}

--- conditional assignment chaining
--- 0 is not "false", only "false" or "nil" are

a = 0
b = a or 1 -- b is 0, if a was false/nil it would be 1

c = (a == 0) and b or 2 -- c is 0 (b's value)

-- the first time a value is "valid" (non-false/nil) that value is taken
-- that way you can do default values

function func(a,b)
  a = a or 1
  b = b or 1

--- sandboxing

function sandboxedfunc()
  -- after setfenv below we can only call what is enabled in the enviroment
  -- so in the example below doing stuff like wouldn't work here
  -- blubb becomes created in the current enviornment
  blubb = doit()

local enva = { 
  doit = function () return 1 end

local envb = { 
  doit = function () return 2 end

--enva.blubb is now 1

--envb.blubb is now 2

-- sandboxedfunc could also come from a file, which makes creating fileformats
-- quite easy, as they can internally be lua code

--- functions without () and function chaining

-- to make ini/config files quite easy, lua allows omitting () for function 
-- calls when the argument is either a string or a table

function testfunc( a )


-- valid calls to above function
testfunc "blah"
testfunc {1,2,3}

-- we can even expand this to create fileformat like structures

function group( name)
  return function (content)
    local grp = {
      name = name,
      content = content,
    return grp

local grp = group "test" {1,3,5}
-- equvialent to: group("test")({1,3,5})
-- grp.content[2] = would be 3

-- could also build a hierarchy
local grp = group "root" {
  group "child a" {},
  group "child b" {},

-- grp.content[1].name would be "child a"

Saturday, January 1, 2011

estrela as shader editor

Recently doing more work with Lua and Cg/GLSL again, hence added a couple features to estrela editor.

Lua wise I had added some experimental type-guessing mostly meant to aid auto-completion for luxinia classes. Also the lua-apis that get loaded can now be specified by interpreter, so that no luxinia functions get suggested when you are using a "normal" lua interpreter. Getting useful auto-completion and api help is still a big task so. Especially getting user created functions/classes in somehow would be great. Maybe a static tool that generates files from a lua project or so.

Most problems with "dynamic" text analysis was that when the user edits old stuff, you have to also somehow check whether keywords were changed, added, removed... hence I kinda avoid that complexity yet. I'd rather prefer a static solution that the user triggers, that way it's hopefully simpler and more robust.

Another focus lately was the Cg tool. I've added support for nvShaderPerf and an ARB/NV program beautifier (indenting branches/flow, and inserting comments as to which constants map to what variable). That makes it a bit easier to see what stuff triggers branching and so on.
I've also added automatic setting of GLSL input flag for cgc and some automatic defines such as "_VERTEX_"... so that one can use #ifdef _VERTEX_ and still have all GLSL shader code in one file. A GLSL spec and api description is now also part of estrela. I took the nice opengl 4.1 quick reference card as base. So much for now.

Still haven't found time to push the open-sourcing of luxinia further and add GLSL shader management (but will require ARB_separate_shader) to it for PhD work. But anyway new year now ;)