Monday, January 05, 2009

How to use Brook+ for GPU computing

AMD Stream Computing allows developers to use the GPU to perform parallel computations for HPC applications. This guide is meant to complement the AMD Stream Computing User Guide. It is essential to read the official User Guide to gain a brief understanding before following the notes below.

Ref: http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf

System - The following notes are compiled based on the following system.
Intel CPU
ATI Radeon (Check cards for GPU computing capability)
Microsoft Visual Studio .Net with C/C++ compilers - for C/C++ code
Intel Visual Fortran Compilers - for Fortran code
Brook+ SDK by AMD - to compile Brook code


br source file
================
The Brook+ source file contain code that follow C/C++ syntax and is compiled/pre-processed by the Brook+ compiler into C/C++ file. Both Brook+ functions and C/C++ functions can exist within the same *.br file. The Brook+ functions are the functions that utilises the GPU hardware.

Example of a Brook+ function is given below:
kernel void sumaa(float a<>, float b<>, out float c<>){
c = a + b;
}

1. Special Brook+ keywords (ie. Not C/C++ words): kernel, out
2. Note the template like structures "float a<>" which are recognized by the Brook+ compiler. They indicate stream / GPU data type and are not the same as C++ templates.
3. Multiple functions like the above can exist in the same *.br file. Other normal C/C++ functions can also exist inside the *.br file.


Compiling Brook+ Code (*.br)
=============================
0. Open up a Command Console and go to the directory where the *.br file is located.

1. To compile code called sum.br:
\sdk\bin\brcc_d -k sum.br
where is the installation directory of the Brook SDK from AMD.

2. This the brook+ compiler / preprocessor creates the following in the same directory.
sum.cpp
sum.h
sum_gpu.h

3. A few notes to consider
i) There are two compilers: brcc and brcc_d. They correspond to brook.lib/dll and brook_d.lib/dll respectively.
Using the wrong combination may crash the program during execution.
ii) The -k option generates intermediate code that may be useful for use with the AMD's Stream Kernel Analyzer.
iii) The C/C++ code that are generated need to be compiled using standard C/C++ compilers and link to the proper libraries and dlls, hence the next section.
iv) Before v1.3, C/C++ wrapper functions, also known as host side code, exist within the *.br source file. As of v1.3, the host side code can be written in C++ and exist in a separate normal C++ file, provided it is configured with the proper include and lib directory information.



Compiling the C/C++ code
==========================
This step produces a win32 DLL from the C/C++ code that are generated by Brook+. The resultant DLL should be
able to be used by other win32 applications (eg C++ or Fortran).

1. From Visual Studio .Net, Open a new solution / project by:
Add Project -> Visual C++ -> Win32 -> Win32 project.
In the Application Settings dialog, select DLL, Export Symbols

2. Add the *.br and the files generated by the Brook+ compiler into the current project by using
"Add existing file".

3. Under the Project Property configuration pages, add the following settings:
C++ -> Additional Include Directories: \sdk\include
C++ -> Code Generation -> Runtime Library: Multi-threaded Debug DLL (/MDd)
C++ -> Advanced -> Calling Convention: __cdecl (/Gd)
Linker -> Additional Library Directories: \sdk\lib
Linker -> Input -> Additional Dependencies: \sdk\lib\brook_d.lib

4. When the *.br is modified, compile the *.br files in Command Console, then compile the generated c/c++ code from within the VisualStudio.Net environment.

Some Notes:
i) One can configure VisualStudio.Net to accept *.br files and compile using the Brook+ compiler. However, I find
that it still requires the user to manually initiate compilation for Brook files and then for C/C++ files. Hence,
I don't find it to be any efficient than compiling by command line.
ii) The *.br source files can be added to the project and can be edited using the VS.Net environment.


The C/C++ driver or library wrapper
====================================
The Brook+ functions need to be wrapped or called directly from C/C++ functions. For the purpose of creating DLL functions, we will put C/C++ wrappers over the Brook+ functions.

The usage of the Brook+ functions involve 3 steps. Each of these step are described with examples here:

Declaring and sizing variables - the meaning and reason for the declarations will become clear in the following sections.
// Normal C/C++ variables
float input_a[10][10];
float input_b[10][10];
float input_c[10][10];
float input_a1[10];
float input_b1[10];
float input_c1[10];

// For dimensioning Brook+ variables
unsigned int ileng = 10;
unsigned int dims[2] = {10,10};
unsigned int dim1[1] = {10};

// Equivalent Brook+ variables
brook::Stream a(2, dims);
brook::Stream b(2, dims);
brook::Stream a1(1, dim1);
brook::Stream b1(1, dim1);
brook::Stream *c = new brook::Stream(1, &ileng);
brook::Stream *d = new brook::Stream(2, dims);
brook::Stream d1(1, dim1);


// Assign values to normal C/C++ vectors and matrices for:
// input_a1, input_b1, input_a, input_b
..................

1. Reading normal C/C++ variables into Brook+ variables
a.read(input_a);
b.read(input_b);
a1.read(input_a1);
b1.read(input_b1);
This step transforms a normal C/C++ variable into a Brook+ variable which the GPU can understand. No other manipulation need to be done to the Brook+ variable.

2. Performing the computation by calling the Brook+ function
sumaa(a,b,*d); // operating on a matrix
sumaa(a1,b1,d1); // operating on a vector

3. Writing the output from Brook+ into normal C/C++ variables
// old method
streamWrite(*d, input_c);
streamWrite(d1, input_c1);
// new method
d->write(input_c);
d1.write(input_c1);
Once the Brook+ variable has been copied back to a normal C/C++ variable, one can perform other standard operations to the normal C/C++ variable as desired.

Note the use of pointer d* and non-pointer d1 is just to show that both ways are possible.


Using with Fortran
====================
Brook+, being like an extension to C/C++, is better called from C/C++ functions. But, provided that C/C++ wrappers are built for the Brook+ functions and then packaged into a DLL library, then any other language, eg Fortran, can use the GPU by calling on the C/C++ wrappers in the DLL.
Brook+ functions <--- C/C+ wrappers <--- Windows DLL / Unix shared objects <--- Fortran