Simulations by Maciej Matyka: Compute Shaders in Open Frameworks (tutorial)

Introduction

This tutorial is the first part of the mini compute shader serie. See part 2 on continuation with more specific fluid flow 3D simulation in compute shaders.

See lecture in which I describe the tutorial in details.

I was asked by my friend to prepare short lecture for students about doing LBM (fluid flow simulation) in compute shaders. I've done it in the past using OpenGL low level code and I must say I didn't like it. It worked but adding anything was a headache (a lot of binding, buffers, and other machine state related unpleasant stuff). Thus, although it was quite cool to see large simulations run in realtime with proper visualization, I didn't continue this.

Now, as we have 2022 and OpenGL has progressed a bit as well as myself learned to use third-party libraries I found another way. This is Open Frameworks (OF), really cool and nice C++ library that mainain much of the work that is unrelated to the core simulation algorithm but must be done anyway (graphics, window, shader loading etc.). Although it is now easier to work with (you will see, compute shaders are working like a charm) I found it rather difficult to start with OF from scratch, even though I learned compute shaders before and I knew 80% of the stuff here. This is mostly because OF documentation is weak and very often point to OF forum. This is why I decided to write this tutorial - I write it hoping that it will be useful to beginners who want to make first steps into compute shaders.

The aim of this tutorial is to provide a quick introduction to compute shaders in Open Frameworks. I will not introduce OF itself (see forum and documentation for that - it covers it nicely) nor I will help you with initial configuration (again forum + documentation) but you will get actually everything to set up your first compute shader. In the background you will learn how to simulate Gray Scott reaction-diffusion system as I use it to ilustrate the thing.

Gray Scott Reaction-Diffusion using Compute Shader with Open Frameworks

Requirements

To start with you need to have at least OpenGL 4.3 installed on your machine. I tested all in Windows and Linux and all examples work here. Please download the latest OpenFrameworks from the openframeworks.cc website for your system and make sure you can compile the gl/computeShaderTextureExample project from the library. If you can't run it - that means your system is not compute shader ready (too low opengl, drivers problem, etc.). Here I used of_v0.9.8_vs_release for this tutorial.

What is compute shader?

In simple words - this is a program that runs entirely on GPU. GPU is a vector processor means it can run thousands of same processes at the same time. We will use that and run some simple procedures in the NX x NY image. This may be i.e. simple simulation.

Fig: So, this is GPU - another procesor, separated from main CPU.

Starting point

We start using three basic and empty OF files: main.cpp, ofApp.cpp and ofApp.h. Note I even removed some ususal OF standard methods to make it clear. Our ofApp class will consist of three crucial functions - setup, update and draw.

---------------------------------- ofApp.h
#pragma once
#include "ofMain.h"

class ofApp : public ofBaseApp
{
   public:
        void setup();
        void update();
        void draw();
};
---------------------------------- ofApp.h (end)

---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

void ofApp::setup()
{
}

void ofApp::update()
{
}

void ofApp::draw()
{
}
---------------------------------- ofApp.cpp (end)

The main file will just open the proper openGL window (thus we call ofGLWindowSettings to set its properties).

---------------------------------- main.cpp
#include "ofMain.h"
#include "ofApp.h"

int main( )
{
   ofGLWindowSettings settings;
    //settings.setSize(W,H);        // this is needed if you work in linux
   settings.width = W;
   settings.height = H;
   ofCreateWindow(settings);
   ofRunApp(new ofApp());
}
---------------------------------- main.cpp (end)

What do we want to write?

Let us write some simple computer simulation to ilustrate how shaders work. We will implement the Gray Scott model of diffusion-reaction system that may be implemented in discrete world easily. See some web resources to study the subject in more details:
1. https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system or
2. https://www.karlsims.com/rd.html
3. https://www.lanevol.org/resources/gray-scott

In brief: we will work on four rectangular fields A1,A2,B1 and B2 which are all of the (NX,NY) size. We need four, because we will use the ping-pong strategy where fields "1" will be source for changes in fields "2" in even and vice-versa in odd time steps. The algorithm seeks for the solution to a set of equations (by discretizing them). The equations are rather simple and you may find them elsewhere. Here I will just type them as written in the code:

A2 = A1 + (DA * laplA - A1 * B1 * B1 + f * (1 - A1)) * dt;
B2 = B1 + (DB * laplB + A1 * B1 * B1 - (k+f) * B1) * dt;

where laplA and laplB are discrete versions of laplace operators, 1 and 2 interchange depends on the odd/even time step and f, k are some model parameters that you may change. It's up to you what you choose but be aware that changing too much may blow your model and simulation. Some safe values chosen by me you will find below in the code.

That is - the algorithm is simple. After you initialize your A1 and A2 by value 1.0 plus B2 by value 0.0 and B1 randomly to 0.0 and somewhere by 1.0 you simply iterate the above two equations (remember to swab 1 and 2 from step to step) and visualize A and B fields.

Prepare buffers for shader implementation

As it was said our shader will work entirely on GPU. That means one need to prepare memory to keep A1,A2,B1 and B2 there. What I do below is that I create buffers on CPU first, fill them with some initial values and then allocate (with CPU->GPU copy) the same memory on GPU using GL_SHADER_STORAGE_BUFFER structure. First, we need to declare CPU arrays and GPU buffers, which in Open Frameworks are called ofBufferObject.

---------------------------------- ofApp.h
#pragma once
#include "ofMain.h"

#define W 1280
#define H 720

class ofApp : public ofBaseApp
{
   public:
        void setup();
        void update();
        void draw();

        ofBufferObject A1, B1, A2, B2;

        float A1cpu[W*H];
        float A2cpu[W*H];
        float B1cpu[W*H];
        float B2cpu[W*H];
};
---------------------------------- ofApp.h (end)

Now we can initialize A1cpu...B2cpu arrays and allocate GPU memory.

---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

void ofApp::setup()
{
    // initialize
   for(int x=0; x<W; x++)
   for(int y=0; y<H; y++)
   {
        int idx = x+y*W;
        A1cpu[idx] = 1.0;
        A2cpu[idx] = 1.0;
        if(rand()/float(RAND_MAX)<0.000021)
           B1cpu[idx] = 1.0; else B1cpu[idx] = 0.0;
        B2cpu[idx] = 0.0;
   }

    // allocate buffers on GPU side
   A1.allocate(W*H*sizeof(float),A1cpu,GL_STATIC_DRAW);
   A2.allocate(W*H*sizeof(float),A2cpu,GL_STATIC_DRAW);
   B1.allocate(W*H*sizeof(float),B1cpu,GL_STATIC_DRAW);
   B2.allocate(W*H*sizeof(float),B2cpu,GL_STATIC_DRAW);
}

void ofApp::update()
{
}

void ofApp::draw()
{
}
---------------------------------- ofApp.cpp (end)

Note, that we used one dimensional array mapping of 2D array (means we recalculate idx each time we access the array).

Buffer bindings - let the shader know which buffer is which

As we are done with allocating buffers on GPU, we need to let the shader know which buffer is which.

Fig: Binding may be understood as wire to slot connections.

This is called "binding" and in simple words it is just assigning id numbers to allocated buffers. Why do we need this? In CPU world buffers have names but those are not visible on GPU side, thus, it is much easier to operate on numbers. First buffer in CPU world will be first in GPU and so on. OK, stop talking, let's bind the stuff now.

---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

That was easy right? I promise this is not even a bit harder, really, just stay with me.

We want to see things - visualize results as a texture

Our simulation will perform steps on A1...B2 arrays in GPU world but what we want at the end is to visualize results and put some fancy col4 results on the screen. For that we will use simply a texture which has to be allocated, binded (again!) and drawn in our program. All steps are trivial and I will just illustrate them by coloring changes in code below.

---------------------------------- ofApp.h
#pragma once
#include "ofMain.h"

#define W 1280
#define H 720

class ofApp : public ofBaseApp
{
   public:
        void setup();
        void update();
        void draw();

ofBufferObject A1, B1, A2, B2;

ofTexture tekstura;

        float A1cpu[W*H];
        float A2cpu[W*H];
        float B1cpu[W*H];
        float B2cpu[W*H];
};
---------------------------------- ofApp.h (end)

Please note that binding differ a bit from binding of the buffers - we will now bind the texture as an image. Also allocation needs to be done such that our image will consist of colors (GL_RGBA8 for the pixel format is used here).
---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

void ofApp::setup()
{
tekstura.allocate(W,H,GL_RGBA8);
tekstura.bindAsImage(4,GL_WRITE_ONLY);

    // initialize
   for(int x=0; x<W; x++)
   for(int y=0; y<H; y++)
   {
        int idx = x+y*W;
        A1cpu[idx] = 1.0;
        A2cpu[idx] = 1.0;
        if(rand()/float(RAND_MAX)<0.000021)
           B1cpu[idx] = 1.0; else B1cpu[idx] = 0.0;
        B2cpu[idx] = 0.0;
   }

Culmination - now we actually implement the shader

To get the shader we need to, again, declare it (shader in OF is the ofShader object), load it's source and comple it on GPU side. Now, with OF this is easy. I just add the proper object in ofApp.h class definition:

---------------------------------- ofApp.h
#pragma once
#include "ofMain.h"

#define W 1280
#define H 720

class ofApp : public ofBaseApp
{
   public:
        void setup();
        void update();
        void draw();

        ofBufferObject A1, B1, A2, B2;

       ofTexture tekstura;
       ofShader shader;

        float A1cpu[W*H];
        float A2cpu[W*H];
        float B1cpu[W*H];
        float B2cpu[W*H];
};
---------------------------------- ofApp.h (end)

Next, I implement it's initialization. We will load the source code of the shader (setupShaderFromFile method of ofShader object) and then compile it and link to shader program (linking in OpenGL words).

---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

void ofApp::setup()
{
    shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
   shader.linkProgram();

    tekstura.allocate(W,H,GL_RGBA8);
   tekstura.bindAsImage(4,GL_WRITE_ONLY);

Now, we will implement the update function. The aim of this function is to run shader step by step and swap the buffer binding 1->2 and 2->1 depends on the time step. The technique is simple, we will have some semaphor kind of variable "c" which will be 0 or 1 and do proper binding depend on its value. I use c = 1-c each time step to change 0 to 1 and opposite.

Fig: We use standard ping-pong scheme in which buffers A1/B1 swap with A2/B2 to become source or destination from step to step.

Then, after binding is done, we simply run the shader. This is called "dispatchCompute" (why not run?.. don't ask me). And three parameters we provide are just to divide our taks into parallel, independent threads that will run simultaneously. This is, where all the magic happens - I mean this is where parallel computing is done.

Fig: Here we used DispatchCompute using W/20 x H/20 size for single working group. Just remember, all threads in working group run in parallel and single working group cannot be larger than allowed by your drivers. More details i.e. on Khronos pages here.

Side note: I used W/20 and H/20 for horizontal and vertical number of threads in the run grid. Details on how GPU works and organize threads may be found elsewhere (Khronos, nVidia, AMD or CUDA documentation) but what you need to know is that you can't push too many threads if you'r card doesn't allow for that, so if your simulation doesn't work or work too slow try to decrease or increase this "20" hardcoded there to some other values.

---------------------------------- ofApp.cpp
#include "ofApp.h"
#include "ofConstants.h"

void ofApp::setup()
{
    shader.setupShaderFromFile(GL_COMPUTE_SHADER,"computeshader.cs");
   shader.linkProgram();

    tekstura.allocate(W,H,GL_RGBA8);
   tekstura.bindAsImage(4,GL_WRITE_ONLY);

    A1.bindBase(GL_SHADER_STORAGE_BUFFER, 0);
   A2.bindBase(GL_SHADER_STORAGE_BUFFER, 1);
   B1.bindBase(GL_SHADER_STORAGE_BUFFER, 2);
   B2.bindBase(GL_SHADER_STORAGE_BUFFER, 3);
}

void ofApp::update()
{
   // bindings (ping pong)
   static int c=1;
   c = 1-c;
   A1.bindBase(GL_SHADER_STORAGE_BUFFER, 0+c);
   A2.bindBase(GL_SHADER_STORAGE_BUFFER, 0+1-c);
   B1.bindBase(GL_SHADER_STORAGE_BUFFER, 2+c);
   B2.bindBase(GL_SHADER_STORAGE_BUFFER, 2+1-c);

   // run shader (Gray Scott)
   shader.begin();
   shader.dispatchCompute(W/20, H/20, 1);
   shader.end();
}

void ofApp::draw()
{
   tekstura.draw(0,0);
}
---------------------------------- ofApp.cpp (end)

Last but not least - compute shader

Up to now whar we were doing was the most tedious part. Now we can eat the candy and have fun with compute shader implementation. We will use GLSL languague for that. Don't worry it is very simple. Our starting point is empty shader that says which version of OpenGL we require and empty "main" function. Btw. the ".cs" name is not the only option, I've seen people use ".glsl" as well.

---------------------------------- computeshader.cs
#version 440

void main()
{
}
---------------------------------- computeshader.cs (end)

Now we progress by adding binding layout indications. We need to "receive bindings from the CPU side" so that we can use them later in our code and main function.
In addition layout of the execution grid is specified, I use the same integer number of grid cells as used in dispatch command while running our shader (means 20x20 block size).

---------------------------------- computeshader.cs
#version 440

layout(binding = 0) buffer dcA1 { float A1 [ ]; };
layout(binding = 1) buffer dcA2 { float A2 [ ]; };
layout(binding = 2) buffer dcB1 { float B1 [ ]; };
layout(binding = 3) buffer dcB2 { float B2 [ ]; };
layout(rgba8,binding=4) uniform writeonly image2D img;
layout(local_size_x = 20, local_size_y = 20, local_size_z = 1) in;

void main()
{

}
---------------------------------- computeshader.cs (end)

I use a little "per" help function to deal with periodicity. This is a function that will be called every time our index want to point into out of domain cell. So, our solution will be actually 2D like wraped on the torus (periodic boundary conditions).

---------------------------------- computeshader.cs
#version 440

int per(int x, int nx)
{
   if (x < 0) x += nx;
   if (x >= nx) x -= nx;
   return x;
}

void main()
{

}
---------------------------------- computeshader.cs (end)

Another helpful function will be colour mapping one that I base on the article from the IQ's website (https://iquilezles.org/www/articles/palettes/palettes.htm). It is simple and returns vec4 color mapped to some nice palette based on the float 0-1 input. Adjust coltab[] if you want to get another palette. Nice examples can be found on IQ's website.

---------------------------------- computeshader.cs
#version 440

int per(int x, int nx)
{
   if (x < 0) x += nx;
   if (x >= nx) x -= nx;
   return x;
}

vec4 color(float t)
{
float coltab[]={0.5,0.5,0.5,0.5,0.5,0.5,1.0,0.7,0.4,0.00,0.15,0.20};

vec4 col;
col.r = coltab[0]+coltab[3]*cos(2*3.1416*(coltab[6]*t+coltab[9]));
col.g = coltab[1]+coltab[4]*cos(2*3.1416*(coltab[7]*t+coltab[10]));
col.b = coltab[2]+coltab[5]*cos(2*3.1416*(coltab[8]*t+coltab[11]));
col.a = 1;
return col;
}

void main()
{

}
---------------------------------- computeshader.cs (end)

To implement the main function we need to know which cell do we compute. For this, a standard variable gl_GlobalInvocationID (which is vector with x/y components) is utilized. So we read the x and y and for this cell compute everything that is needed. Note that we use buffers binded from CPU side so this way we can control our shader (this is what we do here to do ping-pong double buffered computation using binding from the cpu side).

Note I hardcode the "W" and "H" which coud be passed by uniform variables from CPU to shaders. But I didn't want to complicate the code further.

I compute the grid index idx and then set all the model constats.

As you may notice I put 5 cents to make the Gray Scott model look better and change the "k" and "f" parameters smoothly with the x index in the grid. This is because I found it quite boring if single values were used. But it is up to you to play with them. One thing I didn't try and may turn to be nice would be to change them in time (i.e. using some sinuses or even impulses).

Next I caluclate indices for all neighbours (those idx0-idx8 variables) and then compute laplacians and final Gray Scott model that consists of... two lines of code. This is funny, so much work to prepare for those two lines.

Finally visualization is done by storing rgb color in the texture that is calculated directly from current A and B data.

---------------------------------- computeshader.cs
#version 440

int per(int x, int nx)
{
   if (x < 0) x += nx;
   if (x >= nx) x -= nx;
   return x;
}

vec4 color(float t)
{
float coltab[]={0.5,0.5,0.5,0.5,0.5,0.5,1.0,0.7,0.4,0.00,0.15,0.20};

void main()
{
   int i, j;

   i = int(gl_GlobalInvocationID.x);
   j = int(gl_GlobalInvocationID.y);

   const int W = 1280; // ugly
   const int H = 720;
   int idx = i+j*W;    // grid index

   float DA = 1.0;    // constants
   float DB = 0.4;
   float f = 0.04;
   float k = 0.065;

   // f and k dependent on "x" position
   float h = 0.5*i / float(W);
   f = 0.02 * h + (1 - h) * 0.018;
   k = 0.035 * h + (1 - h) * 0.051;

   float dt = 1.0;
   int idx0, idx1, idx2, idx3, idx4, idx5, idx6, idx7, idx8;
   int ip, jp, im, jm;

   ip = per(i+1, W);        // periodicity and neighbours
   im = per(i-1, W);
   jp = per(j+1, H);
   jm = per(j-1, H);
   idx0 = i+W*j;            // neighbours
   idx1 = ip+W*(jp);
   idx2 = ip+W*(j);        // i+1,j
   idx3 = ip+W*(jm);
   idx4 = i+W*(jm);        // i,j-1
   idx5 = im+W*(jm);
   idx6 = im+W*(j);        // i-1, j
   idx7 = im+W*(jp);
   idx8 = i+W*(jp);        // i, j+1

   // laplacians
   float laplA = -1.0 * A1[idx0] + .2 * (A1[idx6] + A1[idx2] +                        A1[idx4] + A1[idx8]) + 0.05 * (A1[idx1] + A1[idx3] +                A1[idx5] + A1[idx7]);
   float laplB = -1.0 * B1[idx0] + .2 * (B1[idx6] + B1[idx2] +                        B1[idx4] + B1[idx8]) + 0.05 * (B1[idx1] + B1[idx3] +                B1[idx5] + B1[idx7]);

   // Gray Scott model
   A2[idx0] = A1[idx0] + (DA * laplA - A1[idx0] * B1[idx0] * B1[idx0]                + f * (1 - A1[idx0])) * dt;
   B2[idx0] = B1[idx0] + (DB * laplB + A1[idx0] * B1[idx0] * B1[idx0]                - (k+f) * B1[idx0]) * dt;

   // visualization
   float a = A2[idx];
   float b = B1[idx];
   vec4 col = color(1.51*a+1.062*b);//b*0.8+a*1.3);
   imageStore(img,ivec2(gl_GlobalInvocationID.xy),col);
}

---------------------------------- computeshader.cs (end)

The final code

For those of you that are too lazy to copy from above or for those that jumped from the title to the end section - the final project consists of 4 files listed below.