for loop - Reading files from disk in Python in Parallel -

May 15, 2015

i in process of migrating matlab python, because of vast number of interesting machine learning packages available in python. 1 of issues have been source of confusion me, parallel processing. in particular, want read thousand of text files disk in for loop , want in parallel. in matlab, using parfor instead of for trick, far haven't been able figure out how in python. here example of want do. want read n text files, shape them n1xn2 array, , save each 1 a nxn1xn2 numpy array. , array return function. assuming file names file0001.dat, file0002.dat, etc., code parallelise follows:

import numpy np n=10000 n1=200 n2=100 result = np.empty([n, n1, n2]) counter in range(n):     t_str="%.4d" % counter             filename = 'file_'+t_str+'.dat'     temp_array = np.loadtxt(filename)     temp_array.shape=[n1,n2]     result[counter,:,:]=temp_array

i run codes on cluster, can use many processors job. hence, comment on of parallelisation methods more suitable task (if there more one) welcome.

note: aware of post, in post, there out1, out2, out3 variables worry about, , have been used explicitly arguments of function parallelised. here, have many 2d arrays should read file , saved 3d array. so, answer question not general enough case (or how understood it).

you still want use multiprocessing, structure bit differently:

from multiprocessing import pool  import numpy np  n=10000 n1=200 n2=100 result = np.empty([n, n1, n2])  filenames = ('file_%.4d.dat' % in range(n)) myshaper = lambda fname: np.loadtxt(fname).reshape([n1, nn2])  pool = pool()     i, temparray in enumerate(pool.imap(myshaper, filenames)):     result[i, :, :] = temp_array pool.close() pool.join()

what first generator file names in filenames. means file names not stored in memory, can still loop on them. next, create lambda function (equivalent anonymous functions in matlab) loads , reshapes file (you use ordinary function). applies function each file name in using multiple processes, , puts result in overall array. closes processes.

this version uses more idiomatic python. however, approach more similar original 1 (although less idiomatic) might understand bit better:

from multiprocessing import pool  import numpy np  n=10000 n1=200 n2=100 result = np.empty([n, n1, n2])  def proccounter(counter):     t_str="%.4d" % counter             filename = 'file_'+t_str+'.dat'     temp_array = np.loadtxt(filename)     temp_array.shape=[n1,n2]     return counter, temp_array  pool = pool() counter, temp_array in pool.imap(proccounter, range(n)):     result[counter,:,:] = temp_array pool.close() pool.join()

this splits of for loop function, applies function each element of range using multiple processors, puts result array. original function for loop split 2 for loops.

Search This Blog

Macro

for loop - Reading files from disk in Python in Parallel -

Comments

Post a Comment

Popular posts from this blog

twig - Using Twigbridge in a Laravel 5.1 Package -

php - Symfony 2: "No route found for "GET /" - error on fresh installation -

java - Openshift port-forwarding -