[BioC] pileup function in ShortRead package

Martin Morgan mtmorgan at fhcrc.org
Thu Oct 8 21:33:36 CEST 2009


Hi Nora --

Nora Rieber wrote:
> Hi there,
> 
> I'm using the pileup function in the ShortRead package and it seems I
> call it with the wrong arguments, or that there is some kind of bug in
> the function. All my reads are piled up in forward direction only, even
> if they were actually mapped to the backward strand.
> 
> Here's a small (nonfunctional since I assume I cannot append data to
> this e-mail) example of how I call the pileup function:
> 
> #read in the MAQ alignment file:
> aln_s_7_MCIP4<-readAligned("/home","sequenceALN.txt", type="MAQMapview")
> aln_Pat_subset<-aln_Pat[1:10]
> pos_Patient_subset<-position(aln_Pat_subset)
> str_Patient_subset<-strand(aln_Pat_subset)
> #use only a small subset (all positions are on the same chromosome 1)
> 
> subset_pileuptest<-pileup(pos_Patient_subset, 36,
> max(pos_Patient,na.rm=TRUE)+37,dir=str_Patient_subset)
> 
> Is there anything wrong with my dir argument? I checked several places
> all over the genome for the complete alignment and my reads all seem to
> be considered on the + strand, even when they are not. Can anybody help?

To get something reproducible, I did

library(ShortRead)
example(readAligned)
aln <- readAligned(ap, "s_2_export.txt", "SolexaExport")
lane <- aln[chromosome(aln) == "chr5.fa"]
mlane <- lane[strand(lane)=="-"]

The first position is

> min(position(mlane))
[1] 6774915

Here's a pileup like the one you provide, with a convoluted way of
finding the first non-zero position in the pileup

> p <- pileup(position(mlane), 50, max(position(mlane)) + 50,
+             strand("-"))
> Rle(p)
  'integer' Rle instance of length 140154400 with 25 runs
  Lengths:  6774914 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
  Values :  0 1 0 1 0 1 0 1 0 1 ...

which is to say that the first non-zero position occurs after 6774914
zeros, or at position 6774915. The read's position is interpreted to
mean 'the position is the leftmost nucleotide of the aligned 50mer'.

Here I add a 'readlength' argument


> p <- pileup(position(mlane), 50, max(position(mlane)) + 50,
+             strand("-"), readlength=35)
> Rle(p)
  'integer' Rle instance of length 140154400 with 25 runs
  Lengths:  6774899 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
  Values :  0 1 0 1 0 1 0 1 0 1 ...

and the 'position' is interpreted in the same way -- the leftmost
position of the aligned _read_. The fragment has been extended so the
read is a total of 50 nt long, and the extension has been in the
appropriate (from 5' to 3' on the minus strand, decrementing the number
of zeros before the first pileup).

I introduced the Rle above in preparation for suggesting use of
'coverage' (see ?"AlignedRead-class") as an alternative to pileup; it's
a much more compact representation and ties in better with some of the
down-stream tools.

> coverage(mlane, start=1L, extend=15L)[[1]]
  'integer' Rle instance of length 140154384 with 24 runs
  Lengths:  6774899 50 39858191 50 5640193 50 5762358 50 2355597 50 ...
  Values :  0 1 0 1 0 1 0 1 0 1 ...

'coverage' can also be tricky to get correct.

Martin

> 
> This is my session info:
> 
>> sessionInfo()
> R version 2.9.2 (2009-08-24)
> x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=de_DE.UTF-8;LC_NUMERIC=C;LC_TIME=de_DE.UTF-8;LC_COLLATE=de_DE.UTF-8;LC_MONETARY=C;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] ShortRead_1.2.1   lattice_0.17-25   BSgenome_1.12.3   Biostrings_2.12.9
> [5] IRanges_1.2.3
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.4.1 grid_2.9.2    hwriter_1.1
> 
> Thanks,
> Nora
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list