Friday 15 August 2014

gpgpu - Understanding pattern of execution of CUDA programs with multiple threads in a GPU -



gpgpu - Understanding pattern of execution of CUDA programs with multiple threads in a GPU -

here understanding of execution pattern of cuda threads. if particular thread meets status execute kernel. indexing , accesses of each thread done using thread , block id. but, when came across next piece of code, stumbled. correctness, code gives right result.

__global__ void kernel0(int *a) { int b0 = blockidx.x; int t0 = threadidx.x; __shared__ int shared_a[32][33]; (int g5 = 0; g5 <= 96; g5 += 32) { (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1) (int c1 = t0; c1 <= min(32, -g5 + 99); c1 += 32) shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)]; __syncthreads(); if (32 * b0 + t0 <= 99) (int c2 = 0; c2 <= min(31, -g5 + 98); c2 += 1) shared_a[t0][c2 + 1] = (shared_a[t0][c2] + 5); __syncthreads(); if (((t0 + 31) % 32) + g5 <= 98) (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1) a[(32 * b0 + c0) * 100 + (((t0 + 31) % 32) + g5 + 1)] = shared_a[c0][((t0 + 31) % 32) + 1]; __syncthreads(); } }

my question thread-id within blocksize of 32 executes first 3 for-loop's ?

short answer

every thread execute loops, threads index in interval [0,min(31, -32 * b0 + 99)][t0, c1 <= min(32, -g5 + 99)] work @ inner statement, namely

shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)]

about mapping mechanism

the way have assign each thread correspondent work indexing. illustration next statement executed thread 0 of each block:

if( threadidx.x == 0){ // code }

while 1 execute thread , index 0 one-dimensional grid:

if( threadidx.x + blockidx.x*blockdim.x == 0){ // code }

this code (from simple array reduction) usefull illustrate such behavior:

for( unsigned int s = 1; s < blockdim.x; s *= 2){ int index = 2*s*tid; if( index < blockdim.x){ sdata[index] += sdata[index + s]; } __syncthreads(); }

all threads in block execute loop , of them have own value index variable. then, if statement prevents threads execute addition. add-on performed threads thread number "index".

as see makes threads idle while other have lot of work (load imbalance), desirable homogeneous workload across grid maximize performance.

learning material.

this confusing @ first, encourage read cuda c programming guide included in cuda toolkit. play around matrix-matrix multiplication, vector add-on , vector reduction.

a comprehensive guide "programming massively parallel processors" book, david b. kirk , wen-mei w. hwu.

cuda gpgpu

No comments:

Post a Comment