[BioC] Rsubread vs. BWA, Bowtie, etc. and RPKM vs. normalized counts

Tue Oct 11 00:49:35 CEST 2011

Hi Tim,

For the reads of 36bp long, Subread extracts seven 16 mers from each read.
The default cutoff for calling a hit is to require at least three
consensus 16 mers (mapped to the same location). But this setting assumes
that ten 16 mers are extracted from the read. For your data, you should
use TH1=2 rather than the default setting, because the reads are very
short and you can not extract ten 16 mers from each of them.

On the other hand, you might check the sequencing quality of your data. If
the quality is poor, it will be hard to achieve a high mapping percentage
without losing mapping accuracy. The qualityScores function in Rsubread
can help you to retrieve the quality scores from your data.

Cheers,
Wei

> Oh, cool.  So I pulled the devel version, indexed hg19 with 7400MB of RAM,
> and went to align my reads...
>
> 29813949 reads were processed in 716.9 seconds.
> Percentage of mapped reads is 41.04%.
>
> That was with indels=1.  I am told that this is fairly low, as far as
> mapped
> counts go.  These are 36-base single-end reads.  One piece of advice was
> "run TopHat on it and see if it aligns any more of the reads", so I'm
> doing
> that, but as you pointed out, a very nice feature of Rsubread for me is
> featureCounts.  It's also really fast, so if I can get away with using it,
> I
> am thinking that I will do so from now on...
>
> Thanks for a fast and easy to use alignment package,
>
> --Tim
>
> On Mon, Oct 10, 2011 at 3:26 PM, Wei Shi <shi at wehi.edu.au> wrote:
>
>> Dear Tim,
>>
>> Yes, Subread performs read mapping by mapping a set of 16 mers extracted
>> from each read to the genome, counting the number of mapped 16 mers at
>> each candidate location and then choosing the one which is mapped to by
>> the majority of the 16 mers as the mapping location of the read. The
>> fundamental difference between Subread and other aligners is that it
>> uses
>> a voting method rather than a extension method to determine the mapping
>> locations of the reads, which makes it a lot faster and more sensitive.
>>
>> Subread has a both a C version and an R version. The C version is freely
>> available from sourceforge and the R version is included in the Rsubread
>> package in Bioc.
>>
>> The Rsubread package also includes a function called featureCounts,
>> which
>> can be used to count the number of reads for each exon or gene. So this
>> function will be useful for you to look at the differential expression
>> at
>> both gene level and exon level.
>>
>> Another function which might be useful for your data analysis is the
>> subjunc funtion, which is designed to discover exon junctions. Subjunc
>> uses an idea similar to that of Subread. Our preliminary results showed
>> that subjunc outperformed competing junction detectors in terms of
>> speed,
>> sensitivity and accuracy.
>>
>> The devel version of Rsubread package includes a lot of our recent
>> development for both Subread aligner and Subjunc junction detector, so I
>> would recommend using the devel version if you want to try the Rsubread
>> package.
>>
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>>
>>
>>
>>
>> > A professor sent me a bunch of raw RNA-seq reads (as FASTQ files) and
>> I
>> > want
>> > to align them, and I couldn't really make heads or tails of the
>> options,
>> > so
>> > I listened to what Phil Green told me at a conference and looked
>> around
>> > for
>> > a sensible word-nucleated aligner like he described.  It seems that
>> > Rsubread
>> > works this way?
>> >
>> > http://sourceforge.net/projects/subread/
>> >
>> > I would like BAM files as intermediate output, but my real interest is
>> > differential exon usage in differentiating cells.  Given that the
>> reads I
>> > have to align are relatively short (36bp, SE), is there an advantage
>> or
>> > disadvantage in using subread compared to other options?  And when I'm
>> > done
>> > trimming and aligning, I could choose raw counts, conditional quantile
>> > normalized counts, or something like RPKM to summarize how often a
>> given
>> > exon seems to have been transcribed.  I read this:
>> >
>> > http://seqanswers.com/forums/showthread.php?t=586
>> >
>> > and I see that packages using a Gamma prior for the dispersion of a
>> > Poisson
>> > count model benefit from having raw counts.  If I am after correlated
>> > changes in exon usage depending on other sequence features, is it
>> > reasonable
>> > to use (say) 'cqn' on the raw counts, then log-transform and work with
>> > those
>> > normalized counts?
>> >
>> > Thanks for any suggestions,
>> >
>> > --
>> > Tim Triche, Jr.
>> > USC Biostatistics
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> > http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >
>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for
>> the
>> addressee.
>> You must not disclose, forward, print or use it without the permission
>> of
>> the sender.
>> ______________________________________________________________________
>>
>
>
>
> --
> If people do not believe that mathematics is simple,
> it is only because they do not realize how complicated life is. John von
> Neumann<http://www-groups.dcs.st-and.ac.uk/~history/Biographies/Von_Neumann.html>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}