[BioC] read coverage per transcript for RNASeq

Wed Sep 2 20:09:57 CEST 2009

"Jessica Hunter" <hunter at ohsu.edu> writes:

> Thanks for your help.  The command for filtering for quality worked
> perfectly.
>
> As for the coverage, by 85%, I mean I want to figure out which
> transcript sequences I used as a reference have at least 85% of
> their sequence aligning to one or more reads.  And, yes,
> 'chromosome(aln)' gives me the list of transcripts that each read is
> aligned to.  There are over 7 million reads and
> 'unique(chromosome(aln))' gives me the list of 8497 unique
> transcript names aligned the reads to.  I tried, 'cvg <-
> coverage(aln)', but it didn't run, however 'cvg <-

it's not so helpful to say 'it didn't run'; better to cut-n-paste the
command and error. Also it's useful to provide the output of
sessionInfo() -- the next gen sequencing packages in particular are
moving quickly, and there are fairly substantial differences between
released and development versions, and between versions in the
development branch. So I'm using (for this reply, anyway...)

> sessionInfo()
R version 2.9.2 Patched (2009-09-02 r49533)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.2.1   lattice_0.17-25   BSgenome_1.12.3
Biostrings_2.12.8
[5] IRanges_1.2.3

loaded via a namespace (and not attached):
[1] Biobase_2.4.1 grid_2.9.2    hwriter_1.1

> coverage(aln,width=NULL) did run.  However, 'cvg' gives me:
>
> A GenomeData Instance
> Chromosomes(8497):  chr1.....chr8497

I did

> example(readAligned)
> aln = readAligned(ap, "s_2_export.txt", "SolexaExport")
> aln1 = aln[!is.na(position(aln))]
> cvg = coverage(aln1)

> I know there are 8497 unique reference transcripts I am aligning to,
> so this looks right, but there is actually no read data there, as
> far as number of reads aligned to the transcript or what the
> coverage is.  If I look at one transcript using 'cvg[1]' I just get:
>
> A GenomeData Instance
> Chromosomes(1): chr1

yes, '[' is subsetting; more insight comes from "[[", which extracts

> cvg[[1]] # or cvg[["chr10.fa"]]
 'integer' Rle instance of length 121262769 with 47 runs
 Lengths:  35 15600363 35 1003552 35 10075118 35 4665882 35 26227977 ...
 Values :  1 0 1 0 1 0 1 0 1 0 ...

The idea is that the coverage depth (i.e., 'Values') is '1' for a
length (i.e., 'Lengths') of 35 nt, then the coverage depth is '0' for
15600363 nt, then... Likely your 'Rle' will look much different from
this -- greater coverage values and shorter run lengths. Something
like

  sum(cvg[[1]])

is a short way of doing runValue(cvg[[1]]) * runLength(cvg[[1]]),
which is to say the total number of nucleotides covering cvg[[1]].

> As for the coverage, by 85%, I mean I want to figure out which
> transcript sequences I used as a reference have at least 85% of
> their sequence aligning to one or more reads.  And, yes,

Here I think you want to back up and arrange to call coverage in such
a way that the elements of cvg span the length of the transcript (as
written above, coverage(aln) just uses the limits of the range in
which reads are aligned). The 'start' might be easy, if the
transcripts all start at nt '1'. The 'end' needs to be the length of
the transcript along the lines of end = c(chr10.fa=12345, chr11.fa=54321,
...) and then

  cvg = coverage(aln, start=1, end=end)

> Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run
> and I get the error message: cannot coerce type 'S4' to vector of
> type 'integer'.  As a shot in the dark, I tried 'cvgnum <-
> as.integer(cvg)', but this gives me the same error message.

Sorry I was assuming you were using a recent development version,
where some of the kinks have been smoothed out.

The task is to write a filter that satisfies your criterion, e.g.,
returning TRUE with...

   f = function(elt) {
       sum(runValue(elt) != 0) > 0.85 * length(elt)
   }

You might then convert cvg to a list (of Rle's), use the Filter
function to remove elements not satisfying the criterion, and then
store it conveniently as a GenomeData object again

  cvgf = GenomeData(Filter(f, as.list(cvg)))

cvgf now contains those transcripts satisfying f. Returning to aln1,
you could select the reads aligning to these transcripts with

  aln2 = aln1[chromosome(aln1) %in% names(cvgf)]

Martin

> Any ideas?
>
> Jessica  
>
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
> Sent: Tuesday, September 01, 2009 4:36 PM
> To: Jessica Hunter
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read coverage per transcript for RNASeq
>
> Hi Jessica -- mixing up the order a bit...
>
>> Also, while I am writing, I was interested in only looking at reads
>> above a certain quality score.  When I input the data with
>> readAligned, I couldnâ€™t figure out how to filter on this variable,
>> so I ended up reading the data just into R, filtering that way, and
>> then reading the data into ShortRead.  Is there a better way to do
>> this?  All the transferring of the data back and forth took a lot of
>> time to process.
>
> readAligned takes a filter argument, see ?srFilter (maybe
> alignQualityFilter ? 'quality' is a bit ambiguous) and the final
> example on ?readAligned. Another route is along the lines of
>
>   aln[ quality(alignQuality(aln)) > certainQualityScore ]
>
>> I have short read sequences for RNASeq that I have then aligned to
>> the reference transcript sequences.  The reads are from a Solexa
>> platform and they were aligned using MAQ.  I have imported the data
>> using ShortReads (readAligned) and all looks good.  However, I am
>> interested in limiting the data to those transcripts with at least
>> 85% coverage.  I canâ€™t find any documentation to do this that
>> doesnâ€™t involved looking at each transcript individually.  Any way
>> to do this for all transcripts (30,000+)?
>
> Hmm. You say they're aligned to the reference transcript sequences, so
> it sounds like chromosome(aln) gives you the transcript each is
> aligned to and
>
>   cvg <- coverage(aln)
>
> gives you a list of 'run-length-encoded' coverage vectors, one for
> each transcript. I'm not sure what '85% coverage' means, but maybe
> you're after something like  
>
>   sum(cvg) > .85 * sapply(cvg, length)
>
> A little more detail might lead to a clearer solution. The
> bioc-sig-sequencing news groups
>
>   https://stat.ethz.ch/pipermail/bioc-sig-sequencing/
>
> has several posts recently that might be helpful.
>
> Hope that helps,
>
> Martin
>
>
> "Jessica Hunter" <hunter at ohsu.edu> writes:
>
>> Hello all.
>>
>>  
>>
>> I have short read sequences for RNASeq that I have then aligned to
>> the reference transcript sequences.  The reads are from a Solexa
>> platform and they were aligned using MAQ.  I have imported the data
>> using ShortReads (readAligned) and all looks good.  However, I am
>> interested in limiting the data to those transcripts with at least
>> 85% coverage.  I canâ€™t find any documentation to do this that
>> doesnâ€™t involved looking at each transcript individually.  Any way
>> to do this for all transcripts (30,000+)?
>>
>>  
>>
>>
>>  
>>
>> Lastly, any good online resources for RNASeq applications with
>> Bioconductor that people can recommend?
>>
>>  
>>
>> Thank you,
>>
>>  
>>
>> Jessica
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793