bash - Why is Unix/Terminal faster than R? -


i'm new unix, however, have realized simple unix commands can simple things large data set very quickly. question why these unix commands fast relative r?

let's begin assuming data big, not larger amount of ram on computer.

computationally, understand unix commands faster r counterparts. however, can't imagine explain entire time difference. after basic r functions, unix commands, written in low-level languages c/c++.

i therefore suspect speed gains have i/o. while have basic understanding of how computers work, understand manipulate data first read disk (assuming data local). slow. however, regardless of whether use r functions or unix commands manipulate data both obtain data disk.

therefore suspect how data read disk, if makes sense, driving time difference. intuition correct?

thanks!

update: sorry being vague. done on purpose, hoping discuss idea in general, rather focus on specific example.

regardless, i'll generate example of counting number of rows

first i'll generate big data set.

row = 1e7 col = 50 df<-matrix(rpois(row*col,1),row,col) write.csv(df,"df.csv") 

doing unix

time wc -l df.csv real    0m12.261s user    0m1.668s sys     0m2.589s 

doing r

library(data.table) system.time({ nrow(fread("df.csv")) }) ... user   system   elapsed 26.77    1.67     47.07 

notice elapsed/real > user + system. suggests cpu waiting on disk.

i suspected slow speed of r has reading data in. appears i'm right:

system.time(fread("df.csv")) user  system  elapsed 34.69   2.81    47.41 

my question how i/o different unix , r. why?

i'm not sure operations you're talking about, in general, more complex processing systems r use more complex internal data structures represent data being manipulated, , constructing these data structures can big bottleneck, slower simple lines, words, , characters unix commands grep tend operate on.

another factor (depending on how scripts set up) whether you're processing data 1 thing @ time, in "streaming mode", or reading memory. unix commands tend written operate in pipelines, , read small piece of data (usually 1 line), process it, maybe write out result, , move on next line. if, on other hand, read entire data set memory before processing it, if have enough ram, allocating , organizing necessary memory can expensive.

[updated in response additional information]

aha. were asking r read whole file memory @ once. accounts of difference. let's talk few more things.

i/o. can think 3 ways of reading characters file, if style of processing we're doing affects way that's convenient reading.

  1. unbuffered small, random reads. ask operating system 1 or few characters @ time, , process them read them.
  2. unbuffered large, block-sized reads. ask operating big chunks of memory -- of size 1k or 8k -- , chew on each chunk in memory before asking next chunk.
  3. buffered reads. our programming language gives way of asking many characters want out of intermediate buffer, , code that's built language ("library" code) automatically takes care of keeping buffer full reading large, block-sized chunks operating system.

now, important thing know operating system rather read big, block-sized chunks. #1 can drastically slower 2 , 3. (i've seen factors of 10 or 100.) no well-written programs use #1, can pretty forget it. long you're using 2 or 3, i/o speed same. (in extreme cases, if know you're doing, can little efficiency increase using 2 instead of 3, if can.)

now let's talk way each program processes data. wc has 5 steps:

  1. read characters 1 @ time. (i can assure uses method 3.)
  2. for each character read, add 1 character count.
  3. if character read newline, add 1 line count.
  4. if character read or wasn't word-separator character, update word count.
  5. at end, print out counts of lines, words, and/or characters, requested.

so can see it's i/o , simple, character-based processing. (the step that's @ complicated 4. exercise, once wrote version of wc contrived not of steps 2, 3, , 4 inside read loop if user didn't ask counts. version did indeed run faster if invoked wc -c or wc -l. code more complicated.)

in case of r, on other hand, things quite bit more complicated. first, told read csv file. reads, has find newlines separating lines , commas separating columns. that's equivalent processing wc has do. then, each number finds, has convert internal number can work efficiently. example, if somewhere in csv file occurs sequence

...,12345,... 

r going have read digits (as individual characters) , equivalent of math problem

 1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1 

to value 12345.

but there's more. asked r build table. table specific, highly regular data structure orders data rigid rows , columns efficient lookup. see how work can be, let's use far-fetched hypothetical real-world example.

suppose you're survey company , it's job ask people walking on street questions. suppose questions complicated enough need people seated in classroom @ once. (suppose further people don't mind inconvenience.)

but first have build classroom. you're not sure how many people going walk by, build ordinary classroom, room 5 rows of 6 desks 30 people, , haul in desks, , people start filing in, , after 30 people file in notice there's 31st, do? ask him stand in back, you're kind of fixated on rigid-rows-and-columns idea, ask 31st person wait, , call builders , ask them build second 30-person classroom right next first, , can accept 31st person , in fact 29 more total of 60, notice 61st person.

so ask him wait, , call builders again, , have them build two more classrooms, you've got nice 2x2 grid of 30-person classrooms, people keep coming , enough 121st person shows , there's not enough room , still haven't started asking survey questions yet.

so call fancier builders know how steelwork , have them build big 5-story building next door 50-person classrooms, 5 on each floor, total of 50 x 5 x 5 = 1,250 desks, , have first 120 people (who've been waiting patiently) file out of old rooms new building, , there's room 121st person , quite few more behind him, , hire wreckers demolish old classrooms , recycle of materials, , people keep coming , pretty sure there's 1,250 people in new building waiting surveyed , 1,251st has showed up.

so build giant new skyscraper 1,000 desks on each floor , 100 floors, , demolish old 5-story building, people keep coming, , how big did big data set was? 1e7 x 50? don't think 100-story building going big enough, either. (and when you're done this, "survey question" you're going ask "how many rows there?")

contrived may seem, not bad analogy r having internally build table store data set in.

meanwhile, bob's discount survey company, can tell how many people surveyed , how many men , women , in age brackets, down there on streetcorner, , people filing by, , bob jotting down tally marks on clipboards, , people, once surveyed, walking away , going business, , bob isn't wasting time , money building classrooms @ all.

i don't know r, see if there's way construct empty 1e7 x 50 matrix front, , read csv file it. might find quicker. r still have building, @ least won't have false starts.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -