python - Numba function slower than C++ and loop re-order further slows down x10 -
the following code simulates extracting binary words different locations within set of images.
the numba wrapped function, wordcalc in code below, has 2 problems:
- it 3 times slower compared similar implementation in c++.
- most strangely, if switch order of "ibase" , "ibit" for-loops, speed drops factor of 10 (!). not happen in c++ implementation remains unaffected.
i'm using numba 0.18.2 winpython 2.7
what causing this?
imdim = 80 numinsts = 10**4 numinstssub = 10**4/4 bitsnum = 13; xs = np.random.rand(numinsts, imdim**2) iinstinds = np.array(range(numinsts)[::4]) baseinds = np.arange(imdim**2 - imdim*20 + 1) ofst1 = np.random.randint(0, imdim*20, bitsnum) ofst2 = np.random.randint(0, imdim*20, bitsnum) @nb.jit(nopython=true) def wordcalc(xs, iinstinds, baseinds, ofst, bitsnum, newxz): count = 0 in iinstinds: xi = xs[i] ibit in range(bitsnum): ibase in range(baseinds.shape[0]): u = xi[baseinds[ibase] + ofst[0, ibit]] > xi[baseinds[ibase] + ofst[1, ibit]] newxz[count, ibase] = newxz[count, ibase] | np.uint16(u * (2**ibit)) count += 1 return newxz ret = wordcalc(xs, iinstinds, baseinds, np.array([ofst1, ofst2]), bitsnum, np.zeros((iinstinds.size, baseinds.size), dtype=np.uint16))
i 4x speed-up changing np.uint16(u * (2**ibit))
np.uint16(u << ibit)
; i.e. replace power of 2 bitshift, should equivalent (for integers).
it seems reasonably c++ compiler might making substitution itself.
swapping order of 2 loops makes small difference me both original version (5%) , optimized version (15%), can't think can make useful comment on that.
if wanted compare numba , c++ can @ compiled numba function doing os.environ['numba_dump_assembly']='1'
before import numba. (that's quite involved though).
for reference, i'm using numba 0.19.1.
Comments
Post a Comment