r - dplyr sample_n where n is the value of a grouped variable -

April 15, 2011

i have following grouped data frame, , use function dplyr::sample_n extract rows data frame each group. want use value of grouped variable ndg in each group number of rows extract each group.

> dg.tmp <- structure(list(gene = c("camk1", "ghrl", "timp4", "camk1", "ghrl",  "timp4", "arl8b", "arpc4", "sec13", "arl8b", "arpc4", "sec13" ), glb = c(3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10), ndg = c(1,  1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2)), class = c("tbl_df", "tbl",  "data.frame"), row.names = c(na, -12l), .names = c("gene", "glb",  "ndg"))  > dg <- dg.tmp %>%       dplyr::group_by(glb,ndg)  > dg source: local data frame [12 x 3] groups: glb, ndg        gene glb ndg 1    a4gnt   3   1 2    abtb1   3   1 3     ahsg   3   1 4    a4gnt   3   2 5    abtb1   3   2 6     ahsg   3   2 7    aadac  10   1 8  abhd14b  10   1 9   acvr2b  10   1 10   aadac  10   2 11 abhd14b  10   2 12  acvr2b  10   2

for example, assuming correct random selection, want code

> dg %>% dplyr::sample_n(ndg)

to output:

source: local data frame [6 x 3] groups: glb, ndg        gene glb ndg 1    a4gnt   3   1 2    a4gnt   3   2 3    abtb1   3   2 4    aadac  10   1 5    aadac  10   2 6  abhd14b  10   2

however, gives following error:

error in eval(expr, envir, enclos) : object 'ndg' not found

by way of comparison, dplyr::slice gives correct output when use code

> dg %>% dplyr::slice(1:unique(ndg))

it is hackish using unique in context, however, code

> dg %>% dplyr::slice(1:ndg)

returns following warning messages

warning messages: 1: in slice_impl(.data, dots) :   numerical expression has 3 elements: first used 2: in slice_impl(.data, dots) :   numerical expression has 3 elements: first used 3: in slice_impl(.data, dots) :   numerical expression has 3 elements: first used 4: in slice_impl(.data, dots) :   numerical expression has 3 elements: first used

clearly because ndg being evaluated (in appropriate environment) c(1,1,1) or c(2,2,2), , hence 1:ndg returns above warning.

regarding why obtain error, know code hadley uses method sample_n.grouped_df is

sample_n.grouped_df <- function(tbl, size, replace = false, weight = null,   .env = parent.frame()) {    assert_that(is.numeric(size), length(size) == 1, size >= 0)   weight <- substitute(weight)    index <- attr(tbl, "indices")   sampled <- lapply(index, sample_group, frac = false,     tbl = tbl, size = size, replace = replace, weight = weight, .env = .env)   idx <- unlist(sampled) + 1    grouped_df(tbl[idx, , drop = false], vars = groups(tbl)) }

which can found on relevant github page. obtain error because sample_n.grouped_df cannot find variable ngd because it's not looking in correct environment.

consequently, there neat way of using sample_n on dg obtain

source: local data frame [6 x 3] groups: glb, ndg        gene glb ndg 1    a4gnt   3   1 2    a4gnt   3   2 3    abtb1   3   2 4    aadac  10   1 5    aadac  10   2 6  abhd14b  10   2

by using random sampling on each group?

one possible answer, i'm not convinced it's optimal answer: permute rows of data frame dplyr::sample_frac (and fraction of 1), slice required number of rows:

> set.seed(1) > dg %>%        dplyr::sample_frac(1) %>%       dplyr::slice(1:unique(ndg))

this gives correct output.

source: local data frame [6 x 3] groups: glb, ndg      gene glb ndg 1  a4gnt   3   1 2   ahsg   3   2 3  a4gnt   3   2 4 acvr2b  10   1 5  aadac  10   2 6 acvr2b  10   2

and suppose can write function in 1 line if necessary.

Search This Blog

Macro

r - dplyr sample_n where n is the value of a grouped variable -

Comments

Post a Comment

Popular posts from this blog

symfony - TEST environment only: The database schema is not in sync with the current mapping file -

twig - Using Twigbridge in a Laravel 5.1 Package -

jdbc - Not able to establish database connection in eclipse -