[BioC] Finding my un-counted reads
vobencha at fhcrc.org
Thu Aug 29 00:15:45 CEST 2013
## counting reads
If you haven't done this already I'd confirm counts on a single file. A
bam can be read in with readGAlignments(). If you have paired-end data
you can use ?readGAlignmentPairs or ?readGAlignmentsList.
bf <- BamFile("pathToOneBam")
reads <- readGAlignments(bf) ## devel
reads <- readGappedAlignments(bf) ## release
Take a look at the makeTranscriptDbFromGFF() function in
GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The
advantage of working with a TxDb is that you get the accessors such as
cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).
Get an idea of how many reads hit each list element of the GRangesList.
This method does double count so it's possible for the sum of 'olap' to
be greater than the length of 'reads'.
olap <- countOverlaps(cds_gene, reads)
If the numbers in 'olap' are very small (and this is unexpected) you may
have a problem with how you've created the 'cds_gene' annotation. If the
numbers make sense then try summarizeOverlaps.
se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty")
If the number of hits in assays(se)$counts are greatly reduced you may
have many reads overlapping each other or a single feature in
'cds_gene'. This can be checked by looking at the numbers in 'olap'. If
you are using the devel version (GenomicRanges >= 1.13.9) the
'inter.feature' argument allows reads to be counted for each feature
## mapped vs unmapped reads
readGAlignments() only imports mapped reads; unmapped are dropped. If
you are interested in the number of mapped vs unmapped and some details
about them you can use scanBam(). If your file is large you can use the
'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for
All reads in the file, mapped and unmapped.
scn <- scanBam(bf)
flag1 <- scanBamFlag(isUnmappedQuery=TRUE)
param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1)
unmapped <- (bf, param1)
flag2 <- scanBamFlag(isUnmappedQuery=FALSE)
param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2)
mapped <- (bf, param2)
On 08/28/2013 01:50 PM, Sam McInturf wrote:
> I am working on an Arabidopsis RNA sequencing experiment and have run
> into some trouble. I found that I was able to map ~94% of my ~35 million
> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp
> single end reads). After I counted my mapped reads I found that only about
> 47% of my mapped reads were counted. I found out there is a strong
> possibility that there may be genomic DNA contamination, and I would like
> to verify this hypothesis.
> I used GenomicRanges and rtracklayer to do my read counting:
> myBamFiles <- list.files(pattern = ".bam$")
> myBamIndex <- list.files(pattern = ".bai$")
> bfl <- BamFileList(myBamFiles, index = myBamIndex)
> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
> asRangedData = FALSE)
> cds <- subset(gff0, type == "CDS")
> cds_gene <- reduce(split(cds, cds$gene_id))
> olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty")
> My intention with subsetting for the CDS and then splitting on gene_id is
> to create a GRangesList where each element is a unique AGI number and has
> all of the features.
> To extract the number of reads counted I use:
> The output of colSums(...) is how I have found that less than half my reads
> mapped to transcripts. The first question becomes "Is this how to count
> the number of reads mapped?" The second question becomes "how do I figure
> out where the reads went?"
> My initial inclination would be to make a GRangesList that contained
> intervals between genes and then do the same thing as before and see if I
> can find my other reads. Something in the vein of:
> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
> max(end( ranges(x)))
> which is just an elaborate lapply call to get the smallest start coordinate
> and the largest end coordinate for each unique gene_id and then transform
> it to a pretty data frame. From there I could find a way to get the
> 'in-betweens' of these features, but I am not sure that is the most prudent
> thing to do. I looked at some of the ShortRead and GenomicRanges functions
> for a guiding hand and found the disjointBins() function may be useful in
> establishing features to map over, but am at a bit of a loss as I don't
> have the expertise to figure this out straight away.
> Any suggestions or resources to find mapped but uncounted reads would be
More information about the Bioconductor