regex - Count Unique Occurrences In Text String In R -


this question has answer here:

i using r , have dataframe containing strings of 4 unique letters (dna). interested in counting times unique combinations of letters occur in these strings. 1 of possible scenarios detect how many times see same letter back.

i have come across several possible ways achieve using regex , packages stringr still have 1 problem.

these methods not seem iterate through substring (letter letter) , consider next letter in line count observance. problem same letter repeated more 2x.

example (where want count times "cc" occurs , true_count column desired output):

sequence  stringr_count  true_count acctacgt      1             1 cccccccc      4             7 acccgcct      2             3 

i recommend using stringi::stri_count_fixed follows:

> library(stringi) > seqs <- data.frame(sequence=c('acctacgt', 'cccccccc', 'acccgcct')) > opts <- stri_opts_fixed(overlap=true) > seqs$true_count <- stri_count_fixed(str=seqs$sequence, pattern='cc', opts_fixed=opts) > seqs   sequence true_count 1 acctacgt          1 2 cccccccc          7 3 acccgcct          3 

with fixed pattern stringi order of magnitude faster using gregexpr:

library(microbenchmark)  # answer provided @user20650 in comments f1 <- function(x) sapply(gregexpr('(?=cc)', x, perl=t) , function(i) sum(i>0))  f2 <- function(x) stri_count_fixed(     str=x, pattern='cc',     opts_fixed=stri_opts_fixed(overlap=true))  # generate random sequences sequence <- stri_rand_strings(n=10000, length=1000, pattern='[atgc]') 

microbenchmark results:

> microbenchmark(f1(sequence), f2(sequence)) unit: milliseconds          expr       min        lq      mean    median       uq       max neval  f1(sequence) 290.90393 304.87107 329.11392 313.39819 327.9860 437.10229   100  f2(sequence)  20.99733  21.12559  21.39206  21.26017  21.4377  27.68867   100 

you may take @ biostrings library. experience slower working stringi , requires additional steps provides many useful functions designed work biological sequences, including countpattern:

library(biostrings)  bsequence <- dnastringset(sequence) f3 <- function(x) vcountpattern('cc', x) 

microbenchmark results:

> microbenchmark(f2(sequence), f3(bsequence)) unit: milliseconds           expr      min       lq     mean   median       uq      max neval   f2(sequence) 20.83336 21.11473 21.36759 21.25088 21.45000 23.80708   100  f3(bsequence) 86.95430 89.10023 89.51665 89.37103 89.87699 91.88203   100 

and sure:

> identical(f1(seqs$sequence), f2(seqs$sequence),  f3(bstringset(seqs$sequence)) [1] true 

Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -