[BioC] Fetch scaffold sequences from BSgenome package

Hervé Pagès hpages at fhcrc.org
Tue Sep 18 20:49:56 CEST 2012


Hi Julie,

On 09/18/2012 09:05 AM, Zhu, Lihua (Julie) wrote:
> Herve,
>
> It seems that the getSeq function does not work for scaffold sequence
> such as Zv9_scaffold3564:93,507-93,556 and Zv9_NA384:3,507-3,556 for
> BSgenome.Drerio.UCSC.danRer7_1.3.17. What is the best way to obtain such
> sequences using BSgenome package? Many thanks!

Indeed:

   > library(BSgenome.Drerio.UCSC.danRer7)
   > getSeq(Drerio, "Zv9_scaffold3564")
   Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], 
start[i],  :
     sequence Zv9_scaffold3564 found more than once, please use a 
non-ambiguous name

The problem is this:

   > grep("Zv9_scaffold3564", names(Drerio$Zv9_scaffold), value=TRUE)
   [1] "Zv9_scaffold3564"
   > grep("Zv9_scaffold3564", names(Drerio$upstream1000), value=TRUE)
   [1] "ENSDART00000099775_up_1000_Zv9_scaffold3564_137113_r 
Zv9_scaffold3564:137113-138112"
   [2] "ENSDART00000062791_up_1000_Zv9_scaffold3564_129501_r 
Zv9_scaffold3564:129501-130500"

and also that getSeq() here is looking for a sequence that *contains*
"Zv9_scaffold3564" in its name.

Note that when used with a GRanges object, getSeq() is looking for a
sequence with the exact specified name so that should address the
problem:

   > getSeq(Drerio, GRanges("Zv9_scaffold3564", IRanges(3507, 3556)))
     A DNAStringSet instance of length 1
       width seq
   [1]    50 CCTAAGTATCCACTTTAGTATCCATAACACAATAATCAGATGCTATTGTT

Note that our plan for BioC 2.12 is to get rid of the upstream sequences
in the BSgenome packages (they should never have been included here in
the first place, people will be able to use Paul's new getPromoterSeq()
from the GenomicFeatures package instead) so that will entirely solve
the ambiguous name problem.

Let me know if you have any questions about this.

Thanks,
H.

>
> Best regards,
>
> Julie

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list