[BioC] R Package for retrieving sequence features

Charles C. Berry cberry at tajo.ucsd.edu
Thu Mar 23 19:26:59 CET 2006


On Thu, 23 Mar 2006, Wuming Gong wrote:

> Dear list,
>
> I want to retrieve the sequence features, such as start/end position
> of UTR and CDS, according to given genbank accession numbers.  I found
> that several R functions have the ability to retrieve the sequences
> alone from genbank, such as getSEQ() from annotate package,
> read.GenBank() from apt package (not in Bioconductor) and seqNCBI()
> from GeneR package, but none of them could retrieve the information on
> sequence features.
>
> Is there any R package that can retrieve the sequence features (just
> like get_SeqFeatures() function in BioPerl)?


I do not know of such a capability in package, but it is not hard to 
'roll-your-own' using the file

 	http://hgdownload.cse.ucsc.edu/goldenPath/hg17/database/genscan.txt.gz

Like this:

genePred.fmt <- list(name = "a", chrom = "a", strand = "a",
                     txStart = 1, txEnd = 1, cdsStart = 1,
                     cdsEnd = 1, exonCount = 1, exonStarts = "a",
                     exonEnds = "a")
genPred.dat <- scan( gzfile( file.path( my.path,"genscan.txt.gz" ),
                         what = genePred.fmt)

get.features <-
 	function(x, y=genPred.dat) {
 		indx <- match( x, y$name )
 		sapply( y, "[", indx )
 		}

> get.features( c( "NT_077402.1", "NT_077402.4") )
      name          chrom  strand txStart  txEnd    cdsStart cdsEnd 
exonCount
[1,] "NT_077402.1" "chr1" "+"    "2052"   "4012"   "2052"   "4012"   "3"
[2,] "NT_077402.4" "chr1" "+"    "121020" "124696" "121020" "124696" "5"
      exonStarts
[1,] "2052,2475,3913,"
[2,] "121020,121450,122181,122997,124620,"
      exonEnds
[1,] "2090,2584,4012,"
[2,] "121200,121708,122244,123179,124696,"
>

HTH,

Chuck

>
> Thanks,
>
> Wuming
>
>
>
>
>    [ Part 3.9: "Included Message" ]
>

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717



More information about the Bioconductor mailing list