Research
Variadic Macros and CUDA
0Again, a relatively obvious tool for those who are more experienced with C than I, but this is something I recently stumbled across and found very useful.
One of my major annoyances with CUDA is the way that device emulation works — you go through your code, writing printf statements here and there, compile for device emulation and everything’s fine. But remove your -deviceemu and everything goes horribly wrong, as device functions cannot call host functions. Until now, my only way around the error has been to comment out all of my print statements, which is pretty arduous.
The answer lies with a variadic macro. Define something like this at the top of your CUDA files, or the top of a generic header file included everywhere:
#ifdef DEVICEEMU
#define debug(format, ...) printf(format, ## __VA_ARGS__)
#else
#define debug(format, ...)
#endif
With this in place, where you would have used printf("Some output = %d.\n", variable), debug("Some output = %d.\n", variable) will do exactly the same thing. If -D DEVICEEMU is passed as an argument to nvcc, all calls to debug will be replaced by suitable (and working) printf statements. If it isn’t, they are all replaced by empty lines and the compiler just skips over them.
A few quick changes to your Makefile and everything’s pretty much automatic. Thanks, variadic macros!
Good Results, Bad Timing
0As is often the way with these things, as soon as I had presented a clearly broken and spiky graph at the CUDA conference the solution became obvious. With relatively little work this morning, I produced a much prettier graph:
I remain skeptical of the numbers, of course, but I’m fairly certain that my results are correct. It’s re-taught me a pretty valuable lesson about CUDA, too; the memory access pattern is very, very important. Though my error seems obvious in hindsight, it was not so obvious at the time!
When dividing a two (or three) dimensional array into blocks, it is tempting to simply calculate one’s indices using something like this:
(threadIdx.z*N + threadIdx.y)*N +threadIdx.x.
Perhaps surprisingly, this significantly increases the number of memory accesses for a given warp. Though the entire array has been placed in contiguous memory, cells that appear close to one another within a block are often quite far apart in reality. For example, in the first “tile” of a 2D 128 x 128 grid of floats, cell 0′s south “neighbour” is a massive 512 bytes away. If we assign one thread to each cell in a 4 x 4 tile, then, fetching the four rows requires at least four memory transactions, even on a Tesla.
If we think about this a little bit, we see that rearranging memory such that blocks lie in contiguous memory is relatively easy and, given the impressive speedup, definitely worth considering. An alternative solution is to ensure that tiles are at least 16 x 16, but this may not always be possible due to shared memory constraints.
Since this is something that I somehow managed to overlook for so long, I thought it was worth writing about. Hopefully it might help somebody with similar issues.
1st UK CUDA Developer’s Conference
2Today I attended my first academic conference; the 1st UK CUDA Developer’s Conference, in Oxford. I also presented for the first time at an academic conference — the same one, in fact.
The topic of my presentation was “Parallelising Pipelined Wavefront Computations on the GPU”, and I’ve made the slides available on my e-Portfolio (head to Research > Conferences in the side-bar). Since this work is still very much in-progress, the subject-matter isn’t too complicated, really; it is mainly an explanation of what wavefront computations are, and how we have attempted to optimize them for the GPU. That said, it probably serves quite well as an introduction to my research.
The last few days I was filled with a mild panic, unsure of what to expect, but luckily everything seems to have turned out okay. It was particularly interesting to see what other people are using CUDA for, and I think I’ve picked up a few interesting tricks and tips. If nothing else, it’s been a brilliant start to my PhD.
I believe that the other presentations will eventually be made available publicly… If and when this happens, I’ll be sure to update this post.
