[R] alignment algorithm or pattern frequency calculation

jim holtman jholtman at gmail.com
Fri Oct 19 03:04:02 CEST 2007


Here is an approach that will work as long as there are less than 32
unique characters in the strings.  It uses logical operations to make
the comparison

>library(bitops)
> x <- scan(textConnection('a,b,c,d
+ a,b,c
+ b,c
+ a,b,c,d,e'), sep='\n', what='')
Read 4 items
> x <- strsplit(x, ',')
> x
[[1]]
[1] "a" "b" "c" "d"

[[2]]
[1] "a" "b" "c"

[[3]]
[1] "b" "c"

[[4]]
[1] "a" "b" "c" "d" "e"

> strings <- unique(unlist(x))  # get unique characters
> # convert to integers to use 'bit' operations
> x.i <- sapply(x, function(.str){
+     as.integer(sum(2^(match(.str, strings) - 1)))
+ })
> # create a vector of masks for testing
> mask <- 1:(2^(length(strings)) - 1)
> # 'remove' entries with single characters
> mask[2^(0:(length(strings) - 1))] <- -1
> #setup the strings corresponding to the mask vector
> char.mask <- character(length(mask))
> temp.mask <- mask  # we will be processing this vector
> for (i in seq_along(mask)){
+     j <- which(bitAnd(temp.mask, 1) == 1)
+     char.mask[j] <- paste(char.mask[j], strings[i])
+     temp.mask <- bitShiftR(temp.mask, 1)
+ }
> # now count the matches we have with the masks
> count <- sapply(mask, function(.mask){
+     # if 'and' result is the same as the mask => those characters
are in the string
+     sum(bitAnd(x.i, .mask) == .mask)
+ })
> # determine which counts >= 2
> gte.2 <- which(count >= 2)
> cbind(char.mask[gte.2], count[gte.2])
      [,1]       [,2]
 [1,] " a b"     "3"
 [2,] " a c"     "3"
 [3,] " b c"     "4"
 [4,] " a b c"   "3"
 [5,] " a d"     "2"
 [6,] " b d"     "2"
 [7,] " a b d"   "2"
 [8,] " c d"     "2"
 [9,] " a c d"   "2"
[10,] " b c d"   "2"
[11,] " a b c d" "2"
>
>


On 10/18/07, Weiwei Shi <helprhelp at gmail.com> wrote:
> The rule is search on any combinations whose size is more than 1, like
> ab, ac, abc, ...
>
> The storage can be like b2
> > b2
> [[1]]
> [1] "a" "b" "c" "d"
>
> [[2]]
> [1] "a" "b" "c"
>
> [[3]]
> [1] "b" "c"
>
> [[4]]
> [1] "a" "b" "c" "d" "e"
>
>
> On 10/18/07, Nordlund, Dan (DSHS/RDA) <NordlDJ at dshs.wa.gov> wrote:
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org
> > > [mailto:r-help-bounces at r-project.org] On Behalf Of Weiwei Shi
> > > Sent: Thursday, October 18, 2007 3:32 PM
> > > To: jim holtman
> > > Cc: r-help at stat.math.ethz.ch
> > > Subject: Re: [R] alignment algorithm or pattern frequency calculation
> > >
> > > for example,
> > >
> > > a,b: 3 which means a and b appear together 3 times in my input list.
> > >
> > > On 10/18/07, jim holtman <jholtman at gmail.com> wrote:
> > > > It would be helpful if you explained what the numbers mean.
> > > >
> > > > On 10/18/07, Weiwei Shi <helprhelp at gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > I am looking for an algorithm (better if it is
> > > implemented in R) which
> > > > > can do the following:
> > > > >
> > > > > from the following list:
> > > > > a,b,c,d
> > > > > a,b,c
> > > > > b,c
> > > > > a,b,c,d,e
> > > > >
> > > > > to calculate
> > > > > a,b,c,d: 2
> > > > > a,b,c: 3
> > > > > a,b: 3
> > > > > a,c: 3
> > > > > b,c: 4
> > > > > b,c,d:2
> > > > >
> > > > > here, the order is not important.
> > > > >
> > > > > Thanks.
> > > > >
> >
> > I guess I have more questions than answers.
> > 1. What are the rules for deciding on which patterns to search for? You don't list c,d or c,d,e, or single character patterns.
> > 2. Are these patterns stored in comma separated strings or as character vectors or...?
> >
> > Dan
> >
> > Daniel J. Nordlund
> > Research and Data Analysis
> > Washington State Department of Social and Health Services
> > Olympia, WA  98504-5204
> >
> >
> >
> >
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list