[BioC] pileup coverage from BAM file

Thu Feb 28 18:30:56 CET 2013

Hi,

I have a bam file and I am trying to calculate the coverage at each exon
position (base) within a gene.  First, I find the exon boundaries for the
gene of interest:

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

genesym <- c("CDKN2A")
geneid <- select(org.Hs.eg.db, keys=genesym,
keytype="SYMBOL",cols="ENTREZID")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ex <- exonsBy(txdb, "gene")
ex1 <- ex[[geneid$ENTREZID[1]]]
> ex1
GRanges with 6 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [21967751, 21968241]      - |    127318        <NA>
  [2]     chr9 [21968574, 21968770]      - |    127319        <NA>
  [3]     chr9 [21970901, 21971207]      - |    127320        <NA>
  [4]     chr9 [21974403, 21974826]      - |    127321        <NA>
  [5]     chr9 [21974677, 21975132]      - |    127322        <NA>
  [6]     chr9 [21994138, 21994490]      - |    127323        <NA>
  ---
  seqlengths:
                    chr1                  chr2 ...        chrUn_gl000249
               249250621             243199373 ...                 38502

Next, I would like to find the coverage at each of these positions and
this is where I am getting stuck.  I have tried to follow
example(applyPileups) but I get an error and am not sure that I am
calculating coverage at the base-level.

library(Rsamtools)

# create character list of BAM files (only use first file for now)
fls <- scan("SortedBams_tumor.txt",what='character')

calcInfo <- function(x) {
  ## information at each pile-up position
  info <- apply(x[["seq"]], 2, function(y) {
    y <- y[c("A", "C", "G", "T"),]
    y <- y + 1L
    # continuity
    cvg <- colSums(y)
    p <- y / cvg[col(y)]
    h <- -colSums(p * log(p))
    ifelse(cvg == 4L, NA, h)
  })
  list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info)
}

which <- ex1
param <- PileupParam(which=which, what=c("seq","qual"),
    maxDepth=1000L, yieldBy="position", yieldAll=TRUE)
fl <- PileupFiles(fls[1],param=param)
res <- applyPileups(fl,calcInfo,param=param)
>  Error: applyPileups: 'x' must be an array of at least two dimensions

Is this correct route that I should be taking, or should I be going about
this differently?

Thanks in advance,
Chris

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] Rsamtools_1.10.2
 [2] Biostrings_2.26.3
 [3] TxDb.Hsapiens.UCSC.hg19.knownGene_2.8.0
 [4] GenomicFeatures_1.10.1
 [5] GenomicRanges_1.10.6
 [6] IRanges_1.16.4
 [7] org.Hs.eg.db_2.8.0
 [8] RSQLite_0.9-4
 [9] DBI_0.2-5
[10] AnnotationDbi_1.20.3
[11] Biobase_2.18.0
[12] BiocGenerics_0.4.0

loaded via a namespace (and not attached):
 [1] BSgenome_1.26.1    RCurl_1.5-0        XML_3.2-0          biomaRt_2.14.0
 [5] bitops_1.0-4.1     parallel_2.15.2    rtracklayer_1.18.2 stats4_2.15.2
 [9] tools_2.15.2       zlibbioc_1.4.0

-- 
Christopher Cabanski, PhD
Postdoctoral Research Associate
Department of Medicine, Oncology Division / The Genome Institute
Washington University in St. Louis