for loop - Reading files from disk in Python in Parallel -
i in process of migrating matlab python, because of vast number of interesting machine learning packages available in python. 1 of issues have been source of confusion me, parallel processing. in particular, want read thousand of text files disk in for
loop , want in parallel. in matlab, using parfor
instead of for
trick, far haven't been able figure out how in python. here example of want do. want read n text files, shape them n1xn2 array, , save each 1 a nxn1xn2 numpy array. , array return function. assuming file names file0001.dat
, file0002.dat
, etc., code parallelise follows:
import numpy np n=10000 n1=200 n2=100 result = np.empty([n, n1, n2]) counter in range(n): t_str="%.4d" % counter filename = 'file_'+t_str+'.dat' temp_array = np.loadtxt(filename) temp_array.shape=[n1,n2] result[counter,:,:]=temp_array
i run codes on cluster, can use many processors job. hence, comment on of parallelisation methods more suitable task (if there more one) welcome.
note: aware of post, in post, there out1
, out2
, out3
variables worry about, , have been used explicitly arguments of function parallelised. here, have many 2d arrays should read file , saved 3d array. so, answer question not general enough case (or how understood it).
you still want use multiprocessing, structure bit differently:
from multiprocessing import pool import numpy np n=10000 n1=200 n2=100 result = np.empty([n, n1, n2]) filenames = ('file_%.4d.dat' % in range(n)) myshaper = lambda fname: np.loadtxt(fname).reshape([n1, nn2]) pool = pool() i, temparray in enumerate(pool.imap(myshaper, filenames)): result[i, :, :] = temp_array pool.close() pool.join()
what first generator file names in filenames
. means file names not stored in memory, can still loop on them. next, create lambda function (equivalent anonymous functions in matlab) loads , reshapes file (you use ordinary function). applies function each file name in using multiple processes, , puts result in overall array. closes processes.
this version uses more idiomatic python. however, approach more similar original 1 (although less idiomatic) might understand bit better:
from multiprocessing import pool import numpy np n=10000 n1=200 n2=100 result = np.empty([n, n1, n2]) def proccounter(counter): t_str="%.4d" % counter filename = 'file_'+t_str+'.dat' temp_array = np.loadtxt(filename) temp_array.shape=[n1,n2] return counter, temp_array pool = pool() counter, temp_array in pool.imap(proccounter, range(n)): result[counter,:,:] = temp_array pool.close() pool.join()
this splits of for
loop function, applies function each element of range using multiple processors, puts result array. original function for
loop split 2 for
loops.
Comments
Post a Comment