[BioC] Rsubread alingner

Thu Apr 21 03:49:33 CEST 2011

Joao

Reading the composition bias you see in the beginning of the reads, read

1.	Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina
transcriptome sequencing caused by random hexamer priming. Nucleic
Acids Res 38, e131 (2010).

The reason why you map more reads by trimming them is the standard
reason: you have an error rate in your reads and depending on your
alignment strategy you will always align more reads by trimming them
(until you get to a phase shift where very few reads align).  However,
by trimming the reads you will loose genomic regions that suddenly
become non-unique while the additional reads you succeed in aligning
most likely will be randomly distributed across the expressed genes
(meaning most of the reads will come from the highly expressed genes).
 In my opinion, it is not clear that trimming is better (however,
sometimes the first couple of bases have high error rates as well in
which case you might consider trimming a few bases).

Generally, you don't align more reads because you chomp off the bit
with a composition bias, only because of error rates.

All of this is assuming you are doing RNA-seq using random hexamer
priming, which is the standard library prep for Illumina RNA-seq.

Kasper

On Wed, Apr 20, 2011 at 8:44 PM, João Moura <palerma at gmail.com> wrote:
> Dear Wei Shi,
>
> First of all, thanks a lot for you help. Maybe I was doing it wrong also
> with bowtie, but the truth is I'm a master student, thus quite armature in
> all of this.
>
> Well, the reason why I trimmed my reads in first place was that sequence
> content across all bases is biased (which i think is a common problem in
> RNAseq). As you can see in the picture I send you, the first 14/15 position
> are biased (which I think is due to sequencing adapters?) So, the reason I
> followed was to trim the reads in order to be able to map the 22 bases that
> otherwise would be unmapped because of the adapters or other artifacts in
> the first 14/15 of the reads of the reads. But, since you say that 16bp is
> the minimum requirement for Rsubread, it seems that this problem will be
> solved using your software, right?
>
> Do you know if bowtie also uses this approach of subseting the reads?
> Because if I trim the reads I get an improvement of ~15% on the alignments.
>
> Once again, thank you very much for your help.
>
> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi at wehi.edu.au> wrote:
>
>> Dear João:
>>
>> The default setting of Rsubread alignment requires at least 3 subreads
>> (16bp substrings extracted from the read) to be mapped to the same location
>> to report a hit. Subreads are extracted at a gap of 2bp from the read
>> sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will
>> be performed for mapping three sets of subreads respectively, which are
>> extracted from different start positions of the read (1st base, 2nd base and
>> 3rd base).
>>
>> Therefore, for your trimmed reads which are 20bp long, there will be only
>> up to 2 subreads extracted from each read and they were used for mapping.
>> However, because the consensus cutoff (minimal number of consensus subreads
>> needed to report a hit) is 3 by default (which I believe is the value used
>> in your analysis), no hits can be reported because of insufficient number of
>> extracted subreads.
>>
>> DNA sequences of 20 bases are highly repetitive in human or mouse genome.
>> But I do not know if this is the case for yeast genome. However with shorter
>> read, you are more likely to get ambiguous mapping locations.
>>
>> I would suggest you to use the full read sequences for mapping to take
>> advantage of all the base information in the reads. When mapping 36bp reads,
>> Rsubread can extract 7 subreads from each read. This gives a lot more power
>> for finding mapping locations in a sensitive and accurate way. You can use a
>> consensus threshold of 2 (TH1=2) to call mapping locations, which we found
>> works quite well.
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Apr 21, 2011, at 12:31 AM, João Moura wrote:
>>
>> Dear Wei Shi,
>>
>> Indeed I can run the example of the vignette but not my own trimmed reads.
>> With bowtie I get more than 80% but with RSubread i get 0%. I think I got
>> why. the thing is I'm trimming the reads in R and trying to realign them
>> afterwards. But when I'm triming, I trimmed the quality and DNA strings. So,
>> a line like this:
>>
>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36
>>
>>
>> Will contain "length=36" all the time, even if my reads and length 20. Do
>> you think this might be a problem? Because if I run with untrimmed reads I
>> get some reasonable results.
>>
>> I sent you a untrimmed version and a trimmed version. Maybe you can try...
>>
>>
>> Thanks!
>>
>>
>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi at wehi.edu.au> wrote:
>>
>>> Dear Joao:
>>>
>>> Can you run the example enclosed with Rsubread package?
>>>
>>> Cheers,
>>> Wei
>>>
>>> On Apr 20, 2011, at 3:23 AM, João Moura wrote:
>>>
>>> Dear Wei Shi,
>>>
>>> I'm trying out the Rsubread and I'm having strange results. That is I
>>> firstly built the reference genome, use a fasta file like this:
>>>
>>> *[joao at goedel genomes]$ grep ">" yeast.fna *
>>> *>Scchr01 1 - 230208*
>>> *>Scchr02 1 - 813178*
>>> *>Scchr03 1 - 316617*
>>> *>Scchr04 1 - 1531917*
>>> *>Scchr05 1 - 576869*
>>> *>Scchr06 1 - 270148*
>>> *>Scchr07 1 - 1090946*
>>> *>Scchr08 1 - 562643*
>>> *>Scchr09 1 - 439885*
>>> *>Scchr10 1 - 745667*
>>> *>Scchr11 1 - 666454*
>>> *>Scchr12 1 - 1078175*
>>> *>Scchr13 1 - 924429*
>>> *>Scchr14 1 - 784333*
>>> *>Scchr15 1 - 1091289*
>>> *>Scchr16 1 - 948062*
>>> *>Scmito 1 - 85779*
>>> *
>>> *
>>>
>>> In R I ran the following code:
>>> *
>>> *
>>>
>>> *ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")*
>>> *path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")*
>>> *buildindex(basename = file.path(path, "reference_index"), reference =
>>> ref)*
>>> *reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$",
>>> full=TRUE)*
>>> *align(index = file.path(path, "reference_index"), readfile1 =
>>> reads[[1]],output_file = file.path(path, "alignResults.SAM"))*
>>> *
>>>
>>> *
>>> *
>>> 11152683 reads were processed in 121.8 seconds.
>>> *
>>> *
>>> Percentage of successfully mapped reads is 0.00%.
>>> *
>>> *
>>>
>>> *
>>>  *
>>>
>>> *
>>> *
>>>
>>> *
>>> *
>>> Completed successfully.
>>> *
>>>
>>> *
>>>
>>> *
>>>
>>> If I do the same with bowtie I get more than 70%..What am I doing wrong
>>> here? Thanks a lot!
>>>
>>> --
>>> João Moura
>>>
>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the
>>> addressee.
>>> You must not disclose, forward, print or use it without the permission of
>>> the sender.
>>> ______________________________________________________________________
>>>
>>
>>
>>
>> --
>> João Moura
>>  <thousand.fastq><untri_thousand.fastq>
>>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and inte...{{dropped:15}}
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>