python - Improving a sort program's efficiency; displaying column sort % completion in terminal -


i have big pipe-delimited input file approx 6 million lines below:

24|bbg000sjfvb0|eq0000000009296012|oi sa-adr|oibr/c|us|adr|equity 16|bbg002phvb83|eq0000000022353186|bloom select income fund|blb-u|ct|closed-end fund|equity -50|bbg000v0tn75|eq0000000010271114|mechel-pref spon adr|mtl/p|us|adr|equity 20|bbg002s0zr60|eq0000000022739316|dividend 15 split corp ii-rt|df-r|ct|closed-end fund|equity -20|bbg001r3lgm8|eq0000000017879513|ing floating rate senior loa|isl/u|ct|closed-end fund|equity 0|bbg006m6sxl2|eq0000000006846232|aa plc|aa/|ln|common stock|equity 

requirements below:
1. need sort input file 1st column , 2nd column , 2nd last column in order
2. displaying % of sort completion in terminal/console e.g. "column 2 75% sort done"
3. output in separate file.

i have written program below sorting 1st column perfectly. how incorporate other conditions? taking little more time run. there more efficient , cleaner way it? thing can't use additional outside package cpan. unix solutions using sed/awk ok perl preferable.i came know built-in python there solution welcome.

my (%link_strength); {$data="datascope_input.txt";  $out="sort_file.txt"; open (my $indata , '<', $data)|| die "could not open $data :\n$!"; open (my $outdata , '>', $out)|| die "could not open $out :\n$!"; select $outdata; @array=(<$indata>); (@array){     $link_strength{$1}=$_  if /(?:[^|]+\|){0}([^|]+)/;             } print $link_strength{$_} (sort {$a<=>$b} keys %link_strength);   close ($outdata);   close ($indata); } 

from sample data, going sort approximately 950mb. take 9.5s reading normal hd (100mb/s). not know how fast sorted standard sort experience can go 1-3 millions of records per cpu core. let's 1 million. take 3s on dual core , less on server more cpu cores. think time take reading , parsing of data. simple

pv -p your_file.dat | sort -t'|' -k '1n,1' -k '2d,2' -k '14,14' 

should of required functionality.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -