[BioC] Possible bug in ShortRead pileup function

Martin Morgan mtmorgan at fhcrc.org
Fri Mar 20 14:11:37 CET 2009


Hi again -- 

David Rossell <david.rossell at irbbarcelona.org> writes:

> Hi, the pileup function seems to stop doing the pileup after the first few
> sequences. For instance, if I test the function with my first 10 sequences
> it works fine:
>
>> start <- position(aln1)[sel][1:10]; fraglength <- width(aln1)[sel][1:10];
> dir1 <- str1[1:10]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 193
>
> I manually checked that the result is correct. Then I repeat the process
> with the first 50 sequences
>
>> sel <- chromosome(aln1)=="chr2L"
>> str1 <- factor(strand(aln1)[sel],levels=c('-','+'),labels=c('-','+'))
>> start <- position(aln1)[sel][1:50]; fraglength <- width(aln1)[sel][1:50];
> dir1 <- str1[1:50]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754
>
> So far so good, but now if I repeat the process with the first 100 sequences
> the pileup has not incorporated any new information.
>
>> start <- position(aln1)[sel][1:100]; fraglength <-
> width(aln1)[sel][1:100]; dir1 <- str1[1:100]
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754
>
> Same if I use the full chromosome chr2L
>
>> start <- position(aln1)[sel]; fraglength <- width(aln1)[sel]; dir1 <- str1
>> puprova <-
> pileup(start=start,fraglength=fraglength,chrlength=chrlength["chr2L"],dir=dir1)
>> sum(puprova)
> [1] 754

If I run example(readAligned), and then do the following (to get some
data that is reproducible, though not meaningful)

> example(readAligned)
[snip]
> aln <- readAligned(ap, "s_2_export.txt", "SolexaExport")
> aln1 = aln[!is.na(position(aln))]

I end up with 406 reads, which pile up to

> pup = pileup(position(aln1), 35, max(position(aln1))+35,
+              dir=factor(strand(aln), levels=c("-", "+")))

and

> sum(pup) / width(aln2)
[1] 406

so something non-trivial is going on. Can you provide a reproducible
example (privately or otherwise) and your sessionInfo()? also...

> As you can see below there are many more sequences in chromosome chr2L
>
>> table(chromosome(aln1))
>
>     chr2L  chr2LHet     chr2R  chr2RHet     chr3L  chr3LHet     chr3R
> chr3RHet
>    903343       526    978985      3898    200033      8135    639261
> 6096
>      chr4      chrM      chrU chrUextra      chrX   chrXHet   chrYHet
>      6246      3241     11115     15711    106300      1028       111
>
> There's also a minor bug that returns an error if the argument 'dir' to
> 'pileup' is a factor with levels(dir)=c('+','-'), the routine requires the
> levels to be ('-','+') in this exact order. Simple fixing the levels(dir)==
> c('-','+') for an levels(dir) %in% c('-','+') should fix the issue. Also,
> notice that readAligned typically returns a factor with levels ('-','+','*')
> and this causes an error in pileup.

The code requires factors with levels in the specified order, but the
development (soon to be released) version has tidied this up. One
small facility is a method for strand() that takes a character vector
and creates a factor with the levels ordered consistently.

In general the code in ShortRead and supporting packages has changed
a lot, and using the 'devel' version of R and Bioconductor for these
applications is a good idea.

Here's my sessionInfo()

> sessionInfo()
R version 2.8.1 Patched (2009-02-25 r48007) 
x86_64-unknown-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  utils     datasets  grDevices methods  
[8] base     

other attached packages:
[1] ShortRead_1.0.7    lattice_0.17-15    Biobase_2.2.2      Biostrings_2.10.21
[5] IRanges_1.0.14    

loaded via a namespace (and not attached):
[1] grid_2.8.1         Matrix_0.999375-22

Martin


> Thank you, take care,
>
> David
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioconductor mailing list