r - Removing duplicate records from .Xdf file -
i remove duplicate records large .xdf file trans.xdf. here file details:
file name: /poc/revor/data/trans.xdf number of observations: 1000000000 number of variables: 5 number of blocks: 40 compression type: zlib variable information: var 1: card_id, type: character var 2: se_no, type: character var 3: r12m_cv, type: numeric, low/high: (-2348.7600, 40587.3900) var 4: r12m_roc, type: numeric, low/high: (0.0000, 231.0000) var 5: prod_grp_cd, type: character
also below sample data of file:
card_id se_no r12m_cv r12m_roc prod_grp_cd 900000999000000000 1045815024 110 1 1 900000999000000000 1052487253 247.52 2 1 900000999000000000 9999999999 38.72 1 1 900000999000000000 1090389768 1679.96 16 1 900000999000000000 1091226035 0 1 1 900000999000000000 1091241208 538.68 4 1 900000999000000000 9999999999 83 1 1 900000999000000000 1091468041 148.4 3 1 900000999000000000 1092640358 3.13 1 1 900000999000000000 1093468692 546.29 1 1
i have tried using rxdatastep function use transform parameter call unique() function on .xdf file. below code same:
uniq_dat <- function( datalist ) { datalist <- unique(datalist) return(datalist) } rxdatastepxdf(infile = "/poc/revor/data/trans.xdf",outfile = "/poc/revor/data/trans.xdf",transformfunc = uniq_dat,overwrite = true)
but getting below error:
error in unique(datalist) : object 'datalist' not found error in transformation function: error in unique(datalist) : object 'datalist' not found error in rxcall("rxdatastep", params) :
so point out mistake doing here or if there better way remove duplicate records .xdf file. avoiding loading data inmemory dataframe data pretty huge.
i running above code in revolution r environment on hdfs.
if same can obtained other approach example same appreciated.
thanks in advance :)
cheers,
amit
Comments
Post a Comment