c - how to generalize square matrix multiplication to handle arbitrary dimensions -


i have written program , having trouble understanding how use multiple blocks using dim3 variable in kernel call line. code works fine when doing 1000*1000 matrix multiplication, not getting correct answer lower dimensions 100*100 , 200*200.

#include <stdio.h> #include <cuda.h> #define width 1000  __global__ void kernel(int *a,int *b,int *c) {          int tx = threadidx.x + blockidx.x*blockdim.x;         int ty = threadidx.y + blockidx.y*blockdim.y;          int sum=0,k;          for(k=0;k<(width);++k)         {                 sum += a[ty*width +k]*b[k*width + tx];         }         c[ty*width + tx] = sum; }   int main() {         int a[width*width],c[width*width],b[width*width];         int *dev_a,*dev_b,*dev_c;         int i,count=0;         int size = (width*width)*sizeof(int);           for(i=0;i<(width*width);i++)         {                 a[i] = 1;                 b[i] = 1;         }          cudamalloc((void **)&dev_a,size);         cudamalloc((void **)&dev_b,size);         cudamalloc((void **)&dev_c,size);          cudamemcpy(dev_a,&a,size,cudamemcpyhosttodevice);         cudamemcpy(dev_b,&b,size,cudamemcpyhosttodevice);          dim3 dimblock(20,20);         dim3 blockid(50,50);          kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c);          cudamemcpy(&c,dev_c,size,cudamemcpydevicetohost);          for(i=0;i<(width*width);i++)         {                 count++;                 if(count == (width+1))                 {                         count = 1;                         printf("\n");                 }                  printf("%d ",c[i]);         }         printf("\n");         return 0; } 

this code work specific dimensions not others.

it work square matrix multiplication when width equal product of block dimension (number of threads - 20 in code have shown) , grid dimension (number of blocks - 50 in code have shown).

so when width 20*50 (1000) work shown. if change width other value (say 800) , make no other changes, code won't work. in case of 800, however, code working changing grid dimension 50 40, width = 800 = 20 *40.

but if need multiply 2 matrices of width 799? can't come product of grid , block dimension match width exactly.

this standard problem in cuda programming - cannot come convenient block , grid dimensions match work (i.e. data) size, , if launch many (threads/blocks) things don't seem work.

to fix problem must 2 things:

  1. be sure launch @ least enough, maybe more enough threads (blocks of threads) cover entire data set
  2. add conditional code in kernel, threads corresponding valid data real work.

to address item 1 above, modify our grid dimension calculations this:

    dim3 dimblock(16,16);     dim3 blockid((width+dimblock.x-1)/dimblock.x,(width+dimblock.y-1)/dimblock.y); 

to address item 2 above modify our kernel code condition thread behavior on whether or not thread corresponds valid data:

__global__ void kernel(int *a,int *b,int *c, int mwidth) {          int tx = threadidx.x + blockidx.x*blockdim.x;         int ty = threadidx.y + blockidx.y*blockdim.y;         if ((tx<mwidth)&&(ty<mwidth)){            int sum=0,k;            for(k=0;k<(mwidth);++k)           {                 sum += a[ty*mwidth +k]*b[k*mwidth + tx];           }           c[ty*mwidth + tx] = sum;} } 

and since we've modified kernel new parameter, have pass parameter on invocation:

    kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c, width); 

that should needed logically extend code have shown handle "arbitrary" dimensions. suggest adding proper cuda error checking time having trouble cuda code.


Comments

Popular posts from this blog

How to connect android app to App engine -

gcc - MinGW's ld cannot perform PE operations on non PE output file -

php - display validation error message next to the textbox in codeigniter -