c - how to generalize square matrix multiplication to handle arbitrary dimensions -


i have written program , having trouble understanding how use multiple blocks using dim3 variable in kernel call line. code works fine when doing 1000*1000 matrix multiplication, not getting correct answer lower dimensions 100*100 , 200*200.

#include <stdio.h> #include <cuda.h> #define width 1000  __global__ void kernel(int *a,int *b,int *c) {          int tx = threadidx.x + blockidx.x*blockdim.x;         int ty = threadidx.y + blockidx.y*blockdim.y;          int sum=0,k;          for(k=0;k<(width);++k)         {                 sum += a[ty*width +k]*b[k*width + tx];         }         c[ty*width + tx] = sum; }   int main() {         int a[width*width],c[width*width],b[width*width];         int *dev_a,*dev_b,*dev_c;         int i,count=0;         int size = (width*width)*sizeof(int);           for(i=0;i<(width*width);i++)         {                 a[i] = 1;                 b[i] = 1;         }          cudamalloc((void **)&dev_a,size);         cudamalloc((void **)&dev_b,size);         cudamalloc((void **)&dev_c,size);          cudamemcpy(dev_a,&a,size,cudamemcpyhosttodevice);         cudamemcpy(dev_b,&b,size,cudamemcpyhosttodevice);          dim3 dimblock(20,20);         dim3 blockid(50,50);          kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c);          cudamemcpy(&c,dev_c,size,cudamemcpydevicetohost);          for(i=0;i<(width*width);i++)         {                 count++;                 if(count == (width+1))                 {                         count = 1;                         printf("\n");                 }                  printf("%d ",c[i]);         }         printf("\n");         return 0; } 

this code work specific dimensions not others.

it work square matrix multiplication when width equal product of block dimension (number of threads - 20 in code have shown) , grid dimension (number of blocks - 50 in code have shown).

so when width 20*50 (1000) work shown. if change width other value (say 800) , make no other changes, code won't work. in case of 800, however, code working changing grid dimension 50 40, width = 800 = 20 *40.

but if need multiply 2 matrices of width 799? can't come product of grid , block dimension match width exactly.

this standard problem in cuda programming - cannot come convenient block , grid dimensions match work (i.e. data) size, , if launch many (threads/blocks) things don't seem work.

to fix problem must 2 things:

  1. be sure launch @ least enough, maybe more enough threads (blocks of threads) cover entire data set
  2. add conditional code in kernel, threads corresponding valid data real work.

to address item 1 above, modify our grid dimension calculations this:

    dim3 dimblock(16,16);     dim3 blockid((width+dimblock.x-1)/dimblock.x,(width+dimblock.y-1)/dimblock.y); 

to address item 2 above modify our kernel code condition thread behavior on whether or not thread corresponds valid data:

__global__ void kernel(int *a,int *b,int *c, int mwidth) {          int tx = threadidx.x + blockidx.x*blockdim.x;         int ty = threadidx.y + blockidx.y*blockdim.y;         if ((tx<mwidth)&&(ty<mwidth)){            int sum=0,k;            for(k=0;k<(mwidth);++k)           {                 sum += a[ty*mwidth +k]*b[k*mwidth + tx];           }           c[ty*mwidth + tx] = sum;} } 

and since we've modified kernel new parameter, have pass parameter on invocation:

    kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c, width); 

that should needed logically extend code have shown handle "arbitrary" dimensions. suggest adding proper cuda error checking time having trouble cuda code.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -