[BioC] Finding my un-counted reads

Thu Aug 29 00:15:45 CEST 2013

Hi Sam,

## counting reads

If you haven't done this already I'd confirm counts on a single file. A 
bam can be read in with readGAlignments(). If you have paired-end data 
you can use ?readGAlignmentPairs or ?readGAlignmentsList.

     bf <- BamFile("pathToOneBam")
     reads <- readGAlignments(bf) ## devel
     reads <- readGappedAlignments(bf) ## release

Take a look at the makeTranscriptDbFromGFF() function in 
GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The 
advantage of working with a TxDb is that you get the accessors such as 
cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).

Get an idea of how many reads hit each list element of the GRangesList. 
This method does double count so it's possible for the sum of 'olap' to 
be greater than the length of 'reads'.

     olap <- countOverlaps(cds_gene, reads)

If the numbers in 'olap' are very small (and this is unexpected) you may 
have a problem with how you've created the 'cds_gene' annotation. If the 
numbers make sense then try summarizeOverlaps.

     se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty")

If the number of hits in assays(se)$counts are greatly reduced you may 
have many reads overlapping each other or a single feature in 
'cds_gene'. This can be checked by looking at the numbers in 'olap'. If 
you are using the devel version (GenomicRanges >= 1.13.9) the 
'inter.feature' argument allows reads to be counted for each feature 
they overlap.

## mapped vs unmapped reads

readGAlignments() only imports mapped reads; unmapped are dropped. If 
you are interested in the number of mapped vs unmapped and some details 
about them you can use scanBam(). If your file is large you can use the 
'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for 
param details.

All reads in the file, mapped and unmapped.
     scn <- scanBam(bf)
     names(scn[[1]])

Unmapped only.
     flag1 <- scanBamFlag(isUnmappedQuery=TRUE)
     param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1)
     unmapped <- (bf, param1)

Mapped only.
     flag2 <- scanBamFlag(isUnmappedQuery=FALSE)
     param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2)
     mapped <- (bf, param2)

Valerie

On 08/28/2013 01:50 PM, Sam McInturf wrote:
> Bioconductors,
>      I am working on an Arabidopsis RNA sequencing experiment and have run
> into some trouble.  I found that I was able to map ~94% of my ~35 million
> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp
> single end reads).  After I counted my mapped reads I found that only about
> 47% of my mapped reads were counted. I found out there is a strong
> possibility that there may be genomic DNA contamination, and I would like
> to verify this hypothesis.
> I used GenomicRanges and rtracklayer to do my read counting:
>
> library(GenomicRanges)
> library(rtracklayer)
> library(Rsamtools)
> setwd("/myDir")
> myBamFiles <- list.files(pattern = ".bam$")
> myBamIndex <- list.files(pattern = ".bai$")
>
> bfl <- BamFileList(myBamFiles, index = myBamIndex)
> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
> asRangedData = FALSE)
> cds <- subset(gff0, type == "CDS")
> cds_gene <- reduce(split(cds, cds$gene_id))
> olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty")
>
> My intention with subsetting for the CDS and then splitting on gene_id is
> to create a GRangesList where each element is a unique AGI number and has
> all of the features.
> To extract the number of reads counted I use:
>
> colSums(assays(olap)$counts)
>
> The output of colSums(...) is how I have found that less than half my reads
> mapped to transcripts.  The first question becomes "Is this how to count
> the number of reads mapped?"  The second question becomes "how do I figure
> out where the reads went?"
>
> My initial inclination would be to make a GRangesList that contained
> intervals between genes and then do the same thing as before and see if I
> can find my other reads.  Something in the vein of:
>
> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
> c(min(start(ranges(x))),
>     max(end(  ranges(x)))
>     )
> )
> ))
> which is just an elaborate lapply call to get the smallest start coordinate
> and the largest end coordinate for each unique gene_id and then transform
> it to a pretty data frame.  From there I could find a way to get the
> 'in-betweens' of these features, but I am not sure that is the most prudent
> thing to do.  I looked at some of the ShortRead and GenomicRanges functions
> for a guiding hand and found the disjointBins() function may be useful in
> establishing features to map over, but am at a bit of a loss as I don't
> have the expertise to figure this out straight away.
>    Any suggestions or resources to find mapped but uncounted reads would be
> helpful!
>
>
> Thanks
>