python - Improving a sort program's efficiency; displaying column sort % completion in terminal -
i have big pipe-delimited input file approx 6 million lines below:
24|bbg000sjfvb0|eq0000000009296012|oi sa-adr|oibr/c|us|adr|equity 16|bbg002phvb83|eq0000000022353186|bloom select income fund|blb-u|ct|closed-end fund|equity -50|bbg000v0tn75|eq0000000010271114|mechel-pref spon adr|mtl/p|us|adr|equity 20|bbg002s0zr60|eq0000000022739316|dividend 15 split corp ii-rt|df-r|ct|closed-end fund|equity -20|bbg001r3lgm8|eq0000000017879513|ing floating rate senior loa|isl/u|ct|closed-end fund|equity 0|bbg006m6sxl2|eq0000000006846232|aa plc|aa/|ln|common stock|equity
requirements below:
1. need sort input file 1st column , 2nd column , 2nd last column in order
2. displaying % of sort completion in terminal/console e.g. "column 2 75% sort done"
3. output in separate file.
i have written program below sorting 1st column perfectly. how incorporate other conditions? taking little more time run. there more efficient , cleaner way it? thing can't use additional outside package cpan. unix solutions using sed/awk ok perl preferable.i came know built-in python there solution welcome.
my (%link_strength); {$data="datascope_input.txt"; $out="sort_file.txt"; open (my $indata , '<', $data)|| die "could not open $data :\n$!"; open (my $outdata , '>', $out)|| die "could not open $out :\n$!"; select $outdata; @array=(<$indata>); (@array){ $link_strength{$1}=$_ if /(?:[^|]+\|){0}([^|]+)/; } print $link_strength{$_} (sort {$a<=>$b} keys %link_strength); close ($outdata); close ($indata); }
from sample data, going sort approximately 950mb. take 9.5s reading normal hd (100mb/s). not know how fast sorted standard sort
experience can go 1-3 millions of records per cpu core. let's 1 million. take 3s on dual core , less on server more cpu cores. think time take reading , parsing of data. simple
pv -p your_file.dat | sort -t'|' -k '1n,1' -k '2d,2' -k '14,14'
should of required functionality.
Comments
Post a Comment