r - dplyr sample_n where n is the value of a grouped variable -
i have following grouped data frame, , use function dplyr::sample_n
extract rows data frame each group. want use value of grouped variable ndg
in each group number of rows extract each group.
> dg.tmp <- structure(list(gene = c("camk1", "ghrl", "timp4", "camk1", "ghrl", "timp4", "arl8b", "arpc4", "sec13", "arl8b", "arpc4", "sec13" ), glb = c(3, 3, 3, 3, 3, 3, 10, 10, 10, 10, 10, 10), ndg = c(1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(na, -12l), .names = c("gene", "glb", "ndg")) > dg <- dg.tmp %>% dplyr::group_by(glb,ndg) > dg source: local data frame [12 x 3] groups: glb, ndg gene glb ndg 1 a4gnt 3 1 2 abtb1 3 1 3 ahsg 3 1 4 a4gnt 3 2 5 abtb1 3 2 6 ahsg 3 2 7 aadac 10 1 8 abhd14b 10 1 9 acvr2b 10 1 10 aadac 10 2 11 abhd14b 10 2 12 acvr2b 10 2
for example, assuming correct random selection, want code
> dg %>% dplyr::sample_n(ndg)
to output:
source: local data frame [6 x 3] groups: glb, ndg gene glb ndg 1 a4gnt 3 1 2 a4gnt 3 2 3 abtb1 3 2 4 aadac 10 1 5 aadac 10 2 6 abhd14b 10 2
however, gives following error:
error in eval(expr, envir, enclos) : object 'ndg' not found
by way of comparison, dplyr::slice
gives correct output when use code
> dg %>% dplyr::slice(1:unique(ndg))
it is hackish using unique
in context, however, code
> dg %>% dplyr::slice(1:ndg)
returns following warning messages
warning messages: 1: in slice_impl(.data, dots) : numerical expression has 3 elements: first used 2: in slice_impl(.data, dots) : numerical expression has 3 elements: first used 3: in slice_impl(.data, dots) : numerical expression has 3 elements: first used 4: in slice_impl(.data, dots) : numerical expression has 3 elements: first used
clearly because ndg
being evaluated (in appropriate environment) c(1,1,1)
or c(2,2,2)
, , hence 1:ndg
returns above warning.
regarding why obtain error, know code hadley uses method sample_n.grouped_df is
sample_n.grouped_df <- function(tbl, size, replace = false, weight = null, .env = parent.frame()) { assert_that(is.numeric(size), length(size) == 1, size >= 0) weight <- substitute(weight) index <- attr(tbl, "indices") sampled <- lapply(index, sample_group, frac = false, tbl = tbl, size = size, replace = replace, weight = weight, .env = .env) idx <- unlist(sampled) + 1 grouped_df(tbl[idx, , drop = false], vars = groups(tbl)) }
which can found on relevant github page. obtain error because sample_n.grouped_df
cannot find variable ngd
because it's not looking in correct environment.
consequently, there neat way of using sample_n
on dg
obtain
source: local data frame [6 x 3] groups: glb, ndg gene glb ndg 1 a4gnt 3 1 2 a4gnt 3 2 3 abtb1 3 2 4 aadac 10 1 5 aadac 10 2 6 abhd14b 10 2
by using random sampling on each group?
one possible answer, i'm not convinced it's optimal answer: permute rows of data frame dplyr::sample_frac
(and fraction of 1), slice required number of rows:
> set.seed(1) > dg %>% dplyr::sample_frac(1) %>% dplyr::slice(1:unique(ndg))
this gives correct output.
source: local data frame [6 x 3] groups: glb, ndg gene glb ndg 1 a4gnt 3 1 2 ahsg 3 2 3 a4gnt 3 2 4 acvr2b 10 1 5 aadac 10 2 6 acvr2b 10 2
and suppose can write function in 1 line if necessary.
Comments
Post a Comment