regex - Count Unique Occurrences In Text String In R -
this question has answer here:
i using r , have dataframe containing strings of 4 unique letters (dna). interested in counting times unique combinations of letters occur in these strings. 1 of possible scenarios detect how many times see same letter back.
i have come across several possible ways achieve using regex , packages stringr still have 1 problem.
these methods not seem iterate through substring (letter letter) , consider next letter in line count observance. problem same letter repeated more 2x.
example (where want count times "cc" occurs , true_count column desired output):
sequence stringr_count true_count acctacgt 1 1 cccccccc 4 7 acccgcct 2 3
i recommend using stringi::stri_count_fixed
follows:
> library(stringi) > seqs <- data.frame(sequence=c('acctacgt', 'cccccccc', 'acccgcct')) > opts <- stri_opts_fixed(overlap=true) > seqs$true_count <- stri_count_fixed(str=seqs$sequence, pattern='cc', opts_fixed=opts) > seqs sequence true_count 1 acctacgt 1 2 cccccccc 7 3 acccgcct 3
with fixed pattern stringi
order of magnitude faster using gregexpr
:
library(microbenchmark) # answer provided @user20650 in comments f1 <- function(x) sapply(gregexpr('(?=cc)', x, perl=t) , function(i) sum(i>0)) f2 <- function(x) stri_count_fixed( str=x, pattern='cc', opts_fixed=stri_opts_fixed(overlap=true)) # generate random sequences sequence <- stri_rand_strings(n=10000, length=1000, pattern='[atgc]')
microbenchmark results:
> microbenchmark(f1(sequence), f2(sequence)) unit: milliseconds expr min lq mean median uq max neval f1(sequence) 290.90393 304.87107 329.11392 313.39819 327.9860 437.10229 100 f2(sequence) 20.99733 21.12559 21.39206 21.26017 21.4377 27.68867 100
you may take @ biostrings library. experience slower working stringi
, requires additional steps provides many useful functions designed work biological sequences, including countpattern
:
library(biostrings) bsequence <- dnastringset(sequence) f3 <- function(x) vcountpattern('cc', x)
microbenchmark results:
> microbenchmark(f2(sequence), f3(bsequence)) unit: milliseconds expr min lq mean median uq max neval f2(sequence) 20.83336 21.11473 21.36759 21.25088 21.45000 23.80708 100 f3(bsequence) 86.95430 89.10023 89.51665 89.37103 89.87699 91.88203 100
and sure:
> identical(f1(seqs$sequence), f2(seqs$sequence), f3(bstringset(seqs$sequence)) [1] true
Comments
Post a Comment