python - Splitting a CSV file into equal parts? -


i have large csv file split number equal number of cpu cores in system. want use multiprocess have cores work on file together. however, having trouble splitting file parts. i've looked on google , found sample code appears want. here have far:

def split(infilename, num_cpus=multiprocessing.cpu_count()):     read_buffer = 2**13     total_file_size = os.path.getsize(infilename)     print total_file_size     files = list()     open(infilename, 'rb') infile:         in xrange(num_cpus):             files.append(tempfile.temporaryfile())             this_file_size = 0             while this_file_size < 1.0 * total_file_size / num_cpus:                 files[-1].write(infile.read(read_buffer))                 this_file_size += read_buffer         files[-1].write(infile.readline()) # possible remainder         files[-1].seek(0, 0)     return files  files = split("sample_simple.csv") print len(files)  ifile in files:     reader = csv.reader(ifile)     row in reader:         print row 

the 2 prints show correct file size , split 4 pieces (my system has 4 cpu cores).

however, last section of code prints rows in each of pieces gives error:

for row in reader: _csv.error: line contains null byte 

i tried printing rows without running split function , prints values correctly. suspect split function has added null bytes resulting 4 file pieces i'm not sure why.

does know if correct , fast method split file? want resulting pieces can read csv.reader.

as said in comment, csv files need split on row (or line) boundaries. code doesn't , potentially breaks them somewhere in middle of 1 — suspect cause of _csv.error.

the following avoids doing processing input file series of lines. i've tested , seems work standalone in sense divided sample file approximately equally size chunks because it's unlikely whole number of rows fit chunk.

update

this substantially faster version of code posted. improvement because uses temp file's own tell() method determine changing length of file it's being written instead of calling os.path.getsize(), eliminated need flush() file , call os.fsync() on after each row written.

import csv import multiprocessing import os import tempfile  def split(infilename, num_chunks=multiprocessing.cpu_count()):     read_buffer = 2**13     in_file_size = os.path.getsize(infilename)     print 'in_file_size:', in_file_size     chunk_size = in_file_size // num_chunks     print 'target chunk_size:', chunk_size     files = []     open(infilename, 'rb', read_buffer) infile:         _ in xrange(num_chunks):             temp_file = tempfile.temporaryfile()             while temp_file.tell() < chunk_size:                 try:                     temp_file.write(infile.next())                 except stopiteration:  # end of infile                     break             temp_file.seek(0)  # rewind             files.append(temp_file)     return files  files = split("sample_simple.csv", num_chunks=4) print 'number of files created: {}'.format(len(files))  i, ifile in enumerate(files, start=1):     print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))     print 'contents of file {}:'.format(i)     reader = csv.reader(ifile)     row in reader:         print row     print '' 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -