[BioC] DNA motif sequence prediction - finding a method to compare with

Thu Jan 17 14:57:21 CET 2013

Faheem,

Two places to look with Bioconductor, to get you acquainted with what we currently offer:

   http://www.bioconductor.org/packages/release/BiocViews.html#___MotifDiscovery
   http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/

 - Paul

On Jan 17, 2013, at 5:32 AM, Faheem Mitha wrote:

> 
> Hi,
> 
> I've developed a method for motif sequence search, and I'm trying to find a method to compare it with, because reviewers like to see how your method compares with what is out there. However, I am having some difficulty in finding such a method. To be clear, this is not a de novo motif discovery method, but is related. So, I'm asking the Bioconductor community for help. I'd like to know of methods implemented in software that I can use directly, either in Bioconductor or otherwise. Here are more details about what I have done.
> 
> I'm analyzed two
> [RSS](http://en.wikipedia.org/wiki/Recombination_signal_sequences)
> data sets, each of which is a collection of RSS sequences. The fasta
> files for these data sets are at [human 12
> RSS](http://www.itb.cnr.it/rss/stats/HS12RSS.fasta) and [mouse 12
> RSS](http://www.itb.cnr.it/rss/stats/MM12RSS.fasta).
> 
> The main purpose of the analysis is to predict whether sequences not
> in this family belong to the family. So, I used a cross-validation
> method. I divided each data set into 5 parts, and used 4 of the five
> parts as a training set in turn. (The number 5 here is a bit
> arbitrary, but since I wanted to include the results per training set,
> I didn't want the number to be too large.) After fitting a model to
> the training set, I then used this model for prediction as follows.
> 
> The RSS data set is contained in gene segments, typically one or two
> RSS per gene segment. The gene segments are often much larger than the
> RSS. These are 12RSS, so each RSS is of length 28. I took all the gene
> segments I could find that contained an RSS, and selected from them
> all contiguous sequences of length 28. The current total number of
> these sequences is 449905 for one, and 624400 for the other. The
> corresponding number of RSS is 118 and 201. Note that these sets did
> not necessarily contain all distinct values.
> 
> I then used the model derived from the training set to calculate
> pvalues for all these approx 500,000 sequences, omitting the RSS
> sequences that were in the training set. (I'm leaving out some details
> here, but I don't think it is important how exactly I calculated the
> values.)
> 
> Then I ranked the sequences by order of decreasing pvalues. The hope
> was that the remaining RSS sequences would rank highly in this
> ranking, and in the event they did.
> 
> Now, I'd like to find an algorithm which is already implemented in software, which can perform a similar procedure on the same data in a reasonable amount of time, so I can compare the results. Please let me know if you know of any such things, either in Bioconductor or some other software package. Also, please CC me on any reply. Thanks.
> 
>                                                    Regards, Faheem Mitha
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor