[BioC] getSeq human genomic sequences

Hervé Pagès hpages at fhcrc.org
Mon Feb 2 23:15:26 CET 2009


Hi Tiandao,

I'm glad getSeq() is throwing an error instead of silently trying
to return something when the start/end are out of limits :-)

We supported hg16 (NCBI34) until BioC 2.1 (R-2.6). Starting with
BioC 2.2 (R-2.7, released in April 2008), we dropped it and kept
only hg17 and hg18. Note that hg16 is 5.5 years old now (from
Jul. 2003):

   http://genome.ucsc.edu/FAQ/FAQreleases#release1

I would recommend you try to get a position file with positions
relative to hg17 or hg18 if possible. If this is not an option,
then you can always build your own BSgenome data package for hg16.
The process of building a BSgenome data package is pretty
straightforward and is documented in the BSgenomeForge vignette
from the BSgenome software package. If you don't need to include
masks then the process is even simpler. Let me know if you need
help with this.

Cheers,
H.


Tiandao Li wrote:
> Hi Herve,
> 
> Thanks for your kind help. The position file my colleague sent was 
> included some position out of chromosome limits of current release of 
> human genome (NCBI36). And the positions on human genome were calculated 
> with NCBI34. If there any human genome annotation package based on NCBI34?
> 
> Thanks,
> 
> Tiandao
> 
> On Fri, Jan 30, 2009 at 2:00 PM, Hervé Pagès <hpages at fhcrc.org 
> <mailto:hpages at fhcrc.org>> wrote:
> 
>     Tiandao Li wrote:
> 
> 
> 
>         On Fri, Jan 30, 2009 at 1:04 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>>> wrote:
> 
>            Hi Tiando,
> 
>            Have you checked the 'start' and 'end' you get when i = 5263?
>         They are
>            probably out of limits.
> 
>          > unique(chr6[,6])[5262]
>         [1] 122701977
>          > unique(chr6[,6])[5263]
>         [1] 122701996
>          > unique(chr6[,6])[5264]
>         [1] 39870984
> 
> 
>            Try to isolate the problem with something like:
> 
>             i <- 5263
>             start <- unique(chr6[,6])[i]
>             end <- start + 1
>             getSeq(Hsapiens, "chr6", start, end)
> 
>          > getSeq(Hsapiens, "chr6",
>         unique(chr6[,6])[5262],(unique(chr6[,6])[5262]+1))
>         [1] "CA"
>          > getSeq(Hsapiens, "chr6",
>         unique(chr6[,6])[5263],(unique(chr6[,6])[5263]+1))
>         [1] "TG"
>          > getSeq(Hsapiens, "chr6",
>         unique(chr6[,6])[5264],(unique(chr6[,6])[5264]+1))
>         [1] "TC"
>          >
> 
> 
>     So calling getSeq() outside the for() loop seems to work
>     for i = 5262, 5263 and 5264. Are you sure your for() loop
>     failed for i = 5263?
> 
> 
> 
> 
>            Do you have 1 <= start <= end <= length(Hsapiens$chr6) ?
> 
>         NO
> 
> 
>     Well, that's the problem then. getSeq() will fail if the start/end are
>     not within the limits of the sequence.
>     But what I don't understand is why you are saying NO. The start/end
>     you get are 122701996 and 122701997 (those are the numbers you are
>     showing
>     above) and length(Hsapiens$chr6) is 170899992. Don't you agree that
>     1 <= 122701996 <= 122701997 <= 170899992 ?
> 
>     Another thing I don't understand is that in your first post the
>     error message
>     you got was showing a start value of 63219120 (but that's apparently
>     not the
>     start value corresponding to i = 5263).
> 
> 
> 
> 
>            I've modified the subseq() function so the error mesg you get
>         is a
>            little
>            bit more informative (subseq() is defined in the IRanges
>         package and
>            called by getSeq()). After you update to IRanges 1.0.11
>         (available
>            within
>            24 hours), you will get:
> 
>             > getSeq(Hsapiens, 'chr6', start=170899994, end=170899995)
> 
>          > getSeq(Hsapiens, 'chr6', start=170899994, end=170899995)
>         Error in solveUserSEW(length(x), start = start, end = end, width
>         = width) :
>          solving row 1: 'allow.nonnarrowing' is FALSE and the supplied
>         start (170899994) is > refwidth + 1
> 
> 
>     Yes this is exactly what to expect because here the start/end are out of
>     limits. You'll see the improved error mesg printed by getSeq() when you
>     update IRanges to 1.0.11 (available in about 24 hours).
> 
> 
>          > sessionInfo()
>         other attached packages:
>         [1] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.3          
>                [3] Biostrings_2.10.15                 IRanges_1.0.9    
>                        
> 
> 
>             Error in solveUserSEW(length(x), start = start, end = end,
>         width =
>            width) :
>               solving row 1: 'allow.nonnarrowing' is FALSE and the supplied
>            start (170899994) is > refwidth + 1
>             Error in subseq(unmasked(x), start = start, end = end, width
>         = width) :
>               Invalid sequence coordinates.
>               Are you sure the supplied 'start', 'end' and 'width' arguments
>               are defining a region that is within the limits of the
>         sequence?
> 
>         YES!
> 
> 
>     Well no, not in your call to getSeq above (start=170899994,
>     end=170899995),
>     sorry!
> 
>     All this confusion could be avoided if you sent me some stand-alone
>     code that I can run myself in order to reproduce the problem.
> 
>     Cheers,
> 
>     H.
> 
> 
> 
> 
>            However, on thing is puzzling me. The error mesg you get
>         shows that
>            the start is 63219120 and that is of course within the limits
>         of Human
>            chr6 (this chromosome is 170899992 long in hg18), so maybe
>         something
>            else
>            is going on...
> 
>            Cheers,
>            H.
> 
> 
>            Tiandao Li wrote:
> 
>                Hello,
> 
>                I used getSeq to retrieve human genomic sequences from the
>                BSgenome data
>                package:
> 
>                library(BSgenome.Hsapiens.UCSC.hg18)
> 
>                # cpg is GTF file
>                chr6 <- cpg[grep("6", cpg[[5]]),] # chrom 6
>                length(unique(chr6[,6])) # 10683 genomic positions
> 
>                # loop to match "CG" and print 100bp to file
>                for (i in 1:10683)
>                {
>                if (getSeq(Hsapiens, "chr6",
>                unique(chr6[,6])[i],(unique(chr6[,6])[i]+1))=="CG")
>                {
>                write.table(paste(">",
>                unique(chr6[,6])[i],sep=""),file="chr6FOR",row.names=F,
>         col.names=F,
>                append=TRUE)
>                write.table(getSeq(Hsapiens, "chr6",
>              
>          (unique(chr6[,6])[i]-50),(unique(chr6[,6])[i]+50)),file="chr6FOR",row.names=F,
>                col.names=F, append=TRUE)
>                }
>                }
> 
>                However, I always got the error meg like the following
>         when i = 5263
> 
>                Error in solveUserSEW(length(x), start = start, end =
>         end, width
>                = width) :
>                 solving row 1: 'allow.nonnarrowing' is FALSE and the
>         supplied start
>                (63219120) is > refwidth + 1
> 
>                Please let me know if there's something I should be paying
>                attention to.
> 
>                Thanks!
> 
>                       [[alternative HTML version deleted]]
> 
>                _______________________________________________
>                Bioconductor mailing list
>                Bioconductor at stat.math.ethz.ch
>         <mailto:Bioconductor at stat.math.ethz.ch>
>                <mailto:Bioconductor at stat.math.ethz.ch
>         <mailto:Bioconductor at stat.math.ethz.ch>>
> 
>                https://stat.ethz.ch/mailman/listinfo/bioconductor
>                Search the archives:
>              
>          http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
>            --    Hervé Pagès
> 
>            Program in Computational Biology
>            Division of Public Health Sciences
>            Fred Hutchinson Cancer Research Center
>            1100 Fairview Ave. N, M2-B876
>            P.O. Box 19024
>            Seattle, WA 98109-1024
> 
>            E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> 
>            Phone:  (206) 667-5791
>            Fax:    (206) 667-1319
> 
> 
> 
>     -- 
>     Hervé Pagès
> 
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M2-B876
>     P.O. Box 19024
>     Seattle, WA 98109-1024
> 
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone:  (206) 667-5791
>     Fax:    (206) 667-1319
> 
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list