[BioC] DNA motif sequence prediction - finding a method to compare with

Thu Jan 17 14:32:02 CET 2013

Hi,

I've developed a method for motif sequence search, and I'm trying to find 
a method to compare it with, because reviewers like to see how your method 
compares with what is out there. However, I am having some difficulty in 
finding such a method. To be clear, this is not a de novo motif discovery 
method, but is related. So, I'm asking the Bioconductor community for 
help. I'd like to know of methods implemented in software that I can use 
directly, either in Bioconductor or otherwise. Here are more details about 
what I have done.

I'm analyzed two
[RSS](http://en.wikipedia.org/wiki/Recombination_signal_sequences)
data sets, each of which is a collection of RSS sequences. The fasta
files for these data sets are at [human 12
RSS](http://www.itb.cnr.it/rss/stats/HS12RSS.fasta) and [mouse 12
RSS](http://www.itb.cnr.it/rss/stats/MM12RSS.fasta).

The main purpose of the analysis is to predict whether sequences not
in this family belong to the family. So, I used a cross-validation
method. I divided each data set into 5 parts, and used 4 of the five
parts as a training set in turn. (The number 5 here is a bit
arbitrary, but since I wanted to include the results per training set,
I didn't want the number to be too large.) After fitting a model to
the training set, I then used this model for prediction as follows.

The RSS data set is contained in gene segments, typically one or two
RSS per gene segment. The gene segments are often much larger than the
RSS. These are 12RSS, so each RSS is of length 28. I took all the gene
segments I could find that contained an RSS, and selected from them
all contiguous sequences of length 28. The current total number of
these sequences is 449905 for one, and 624400 for the other. The
corresponding number of RSS is 118 and 201. Note that these sets did
not necessarily contain all distinct values.

I then used the model derived from the training set to calculate
pvalues for all these approx 500,000 sequences, omitting the RSS
sequences that were in the training set. (I'm leaving out some details
here, but I don't think it is important how exactly I calculated the
values.)

Then I ranked the sequences by order of decreasing pvalues. The hope
was that the remaining RSS sequences would rank highly in this
ranking, and in the event they did.

Now, I'd like to find an algorithm which is already implemented in 
software, which can perform a similar procedure on the same data in a 
reasonable amount of time, so I can compare the results. Please let me 
know if you know of any such things, either in Bioconductor or some other 
software package. Also, please CC me on any reply. Thanks.

                                                     Regards, Faheem Mitha