c - how to generalize square matrix multiplication to handle arbitrary dimensions -
i have written program , having trouble understanding how use multiple blocks using dim3 variable in kernel call line. code works fine when doing 1000*1000 matrix multiplication, not getting correct answer lower dimensions 100*100 , 200*200.
#include <stdio.h> #include <cuda.h> #define width 1000 __global__ void kernel(int *a,int *b,int *c) { int tx = threadidx.x + blockidx.x*blockdim.x; int ty = threadidx.y + blockidx.y*blockdim.y; int sum=0,k; for(k=0;k<(width);++k) { sum += a[ty*width +k]*b[k*width + tx]; } c[ty*width + tx] = sum; } int main() { int a[width*width],c[width*width],b[width*width]; int *dev_a,*dev_b,*dev_c; int i,count=0; int size = (width*width)*sizeof(int); for(i=0;i<(width*width);i++) { a[i] = 1; b[i] = 1; } cudamalloc((void **)&dev_a,size); cudamalloc((void **)&dev_b,size); cudamalloc((void **)&dev_c,size); cudamemcpy(dev_a,&a,size,cudamemcpyhosttodevice); cudamemcpy(dev_b,&b,size,cudamemcpyhosttodevice); dim3 dimblock(20,20); dim3 blockid(50,50); kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c); cudamemcpy(&c,dev_c,size,cudamemcpydevicetohost); for(i=0;i<(width*width);i++) { count++; if(count == (width+1)) { count = 1; printf("\n"); } printf("%d ",c[i]); } printf("\n"); return 0; }
this code work specific dimensions not others.
it work square matrix multiplication when width
equal product of block dimension (number of threads - 20 in code have shown) , grid dimension (number of blocks - 50 in code have shown).
so when width
20*50 (1000) work shown. if change width
other value (say 800) , make no other changes, code won't work. in case of 800, however, code working changing grid dimension 50 40, width
= 800 = 20 *40.
but if need multiply 2 matrices of width
799? can't come product of grid , block dimension match width
exactly.
this standard problem in cuda programming - cannot come convenient block , grid dimensions match work (i.e. data) size, , if launch many (threads/blocks) things don't seem work.
to fix problem must 2 things:
- be sure launch @ least enough, maybe more enough threads (blocks of threads) cover entire data set
- add conditional code in kernel, threads corresponding valid data real work.
to address item 1 above, modify our grid dimension calculations this:
dim3 dimblock(16,16); dim3 blockid((width+dimblock.x-1)/dimblock.x,(width+dimblock.y-1)/dimblock.y);
to address item 2 above modify our kernel code condition thread behavior on whether or not thread corresponds valid data:
__global__ void kernel(int *a,int *b,int *c, int mwidth) { int tx = threadidx.x + blockidx.x*blockdim.x; int ty = threadidx.y + blockidx.y*blockdim.y; if ((tx<mwidth)&&(ty<mwidth)){ int sum=0,k; for(k=0;k<(mwidth);++k) { sum += a[ty*mwidth +k]*b[k*mwidth + tx]; } c[ty*mwidth + tx] = sum;} }
and since we've modified kernel new parameter, have pass parameter on invocation:
kernel<<<blockid,dimblock>>>(dev_a,dev_b,dev_c, width);
that should needed logically extend code have shown handle "arbitrary" dimensions. suggest adding proper cuda error checking time having trouble cuda code.
Comments
Post a Comment