r - Subset sequence data in fasta file based on IDs stored in listed data frames -


i trying subset 1 fasta file (containing multiple sequences) several smaller ones based on ids stored in list of data frames (and

i have fasta called fastafile this:

 fastafile <- dput(fastafile) structure(list(r1 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac",      r2 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag",      r3 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca",      r4 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg",      r5 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg",      r6 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"), .names = c("r1",  "r2", "r3", "r4", "r5", "r6")) 

that loaded using seqinr package that:

fastafile <- read.fasta(file = "fastafile.fasta",                         seqtype = c("dna","aa"),                        as.string = true, set.attributes = false) 

i load table ids , expression values

goi <- read.table(header = true, text = "id        t1        t2 1 r1 1.1 2.1 2 r2 1.2 2.2 3 r3 1.1 2.2 4 r4 1.2 2.1 5 r5 1.1 2.1 6 r6 1.2 2.2") 

and split them manageable subsets

goi.split <- split(goi,rep(1:3,each=2)) 

giving me

> goi.split $`1`   id  t1  t2 1 r1 1.1 2.1 2 r2 1.2 2.2  $`2`   id  t1  t2 3 r3 1.1 2.2 4 r4 1.2 2.1  $`3`   id  t1  t2 5 r5 1.1 2.1 6 r6 1.2 2.2 

now subset sequences based on ids in goi.split data frames. in mock example should 2 sequences per list item. subset first 1 of listed data frames can say:

fasta.1 <- fastafile[c(which(names(fastafile) %in% goi.split[[1]][,1]))] # $r1 # [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" #  # $r2 # [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" 

(and on) subset data frames in 1 swift move have list desired fastas (3 list items containing, in case, 2 sequences each). tried:

fastas <- lapply(fastafile, function(i) {fastafile[c(which(names(fastafile) %in% goi.split[[i]][ ,1]))]}) 

could please tell me why not working , have instead.

thanks

this can done following code:

split(fastafile[goi$id], rep(1:3,each=2))   $`1` $`1`$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"  $`1`$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"   $`2` $`2`$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"  $`2`$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"   $`3` $`3`$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"  $`3`$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg" 

as why lapply code not working. 1 reason passing in fastafile, , should passing in indices.

so trying this:

fastafile[c(which(names(fastafile) %in% goi.split[[fastafile[[1]]]][ ,1]))] #named list() 

when should this:

fastafile[c(which(names(fastafile) %in% goi.split[[1]][ ,1]))] #$r1 #[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac" # #$r2 #[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag" 

so, fix it, pass in 1:length(goi.split) instead of fastafile:

lapply(1:length(goi.split), function(i)  {fastafile[c(which(names(fastafile) %in% goi.split[[i]][ ,1]))]})  [[1]] [[1]]$r1 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"  [[1]]$r2 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"   [[2]] [[2]]$r3 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"  [[2]]$r4 [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"   [[3]] [[3]]$r5 [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"  [[3]]$r6 [1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg" 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -