r - Subset sequence data in fasta file based on IDs stored in listed data frames -
i trying subset 1 fasta file (containing multiple sequences) several smaller ones based on ids stored in list of data frames (and
i have fasta called fastafile this:
fastafile <- dput(fastafile) structure(list(r1 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac", r2 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag", r3 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca", r4 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg", r5 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg", r6 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"), .names = c("r1", "r2", "r3", "r4", "r5", "r6"))
that loaded using seqinr package that:
fastafile <- read.fasta(file = "fastafile.fasta", seqtype = c("dna","aa"), as.string = true, set.attributes = false)
i load table ids , expression values
goi <- read.table(header = true, text = "id t1 t2 1 r1 1.1 2.1 2 r2 1.2 2.2 3 r3 1.1 2.2 4 r4 1.2 2.1 5 r5 1.1 2.1 6 r6 1.2 2.2")
and split them manageable subsets
goi.split <- split(goi,rep(1:3,each=2))
giving me
> goi.split $`1` id t1 t2 1 r1 1.1 2.1 2 r2 1.2 2.2 $`2` id t1 t2 3 r3 1.1 2.2 4 r4 1.2 2.1 $`3` id t1 t2 5 r5 1.1 2.1 6 r6 1.2 2.2
now subset sequences based on ids in goi.split
data frames. in mock example should 2 sequences per list item. subset first 1 of listed data frames can say:
fasta.1 <- fastafile[c(which(names(fastafile) %in% goi.split[[1]][,1]))] # $r1 # [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" # # $r2 # [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"
(and on) subset data frames in 1 swift move have list desired fastas (3 list items containing, in case, 2 sequences each). tried:
fastas <- lapply(fastafile, function(i) {fastafile[c(which(names(fastafile) %in% goi.split[[i]][ ,1]))]})
could please tell me why not working , have instead.
thanks
this can done following code:
split(fastafile[goi$id], rep(1:3,each=2)) $`1` $`1`$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" $`1`$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" $`2` $`2`$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca" $`2`$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg" $`3` $`3`$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg" $`3`$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"
as why lapply
code not working. 1 reason passing in fastafile
, , should passing in indices.
so trying this:
fastafile[c(which(names(fastafile) %in% goi.split[[fastafile[[1]]]][ ,1]))] #named list()
when should this:
fastafile[c(which(names(fastafile) %in% goi.split[[1]][ ,1]))] #$r1 #[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" # #$r2 #[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"
so, fix it, pass in 1:length(goi.split)
instead of fastafile
:
lapply(1:length(goi.split), function(i) {fastafile[c(which(names(fastafile) %in% goi.split[[i]][ ,1]))]}) [[1]] [[1]]$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" [[1]]$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" [[2]] [[2]]$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca" [[2]]$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg" [[3]] [[3]]$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg" [[3]]$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"
Comments
Post a Comment