python - Splitting a CSV file into equal parts? -
i have large csv file split number equal number of cpu cores in system. want use multiprocess have cores work on file together. however, having trouble splitting file parts. i've looked on google , found sample code appears want. here have far:
def split(infilename, num_cpus=multiprocessing.cpu_count()): read_buffer = 2**13 total_file_size = os.path.getsize(infilename) print total_file_size files = list() open(infilename, 'rb') infile: in xrange(num_cpus): files.append(tempfile.temporaryfile()) this_file_size = 0 while this_file_size < 1.0 * total_file_size / num_cpus: files[-1].write(infile.read(read_buffer)) this_file_size += read_buffer files[-1].write(infile.readline()) # possible remainder files[-1].seek(0, 0) return files files = split("sample_simple.csv") print len(files) ifile in files: reader = csv.reader(ifile) row in reader: print row
the 2 prints show correct file size , split 4 pieces (my system has 4 cpu cores).
however, last section of code prints rows in each of pieces gives error:
for row in reader: _csv.error: line contains null byte
i tried printing rows without running split function , prints values correctly. suspect split function has added null bytes resulting 4 file pieces i'm not sure why.
does know if correct , fast method split file? want resulting pieces can read csv.reader.
as said in comment, csv files need split on row (or line) boundaries. code doesn't , potentially breaks them somewhere in middle of 1 — suspect cause of _csv.error
.
the following avoids doing processing input file series of lines. i've tested , seems work standalone in sense divided sample file approximately equally size chunks because it's unlikely whole number of rows fit chunk.
update
this substantially faster version of code posted. improvement because uses temp file's own tell()
method determine changing length of file it's being written instead of calling os.path.getsize()
, eliminated need flush()
file , call os.fsync()
on after each row written.
import csv import multiprocessing import os import tempfile def split(infilename, num_chunks=multiprocessing.cpu_count()): read_buffer = 2**13 in_file_size = os.path.getsize(infilename) print 'in_file_size:', in_file_size chunk_size = in_file_size // num_chunks print 'target chunk_size:', chunk_size files = [] open(infilename, 'rb', read_buffer) infile: _ in xrange(num_chunks): temp_file = tempfile.temporaryfile() while temp_file.tell() < chunk_size: try: temp_file.write(infile.next()) except stopiteration: # end of infile break temp_file.seek(0) # rewind files.append(temp_file) return files files = split("sample_simple.csv", num_chunks=4) print 'number of files created: {}'.format(len(files)) i, ifile in enumerate(files, start=1): print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name)) print 'contents of file {}:'.format(i) reader = csv.reader(ifile) row in reader: print row print ''
Comments
Post a Comment