[BioC] accessing GenBank features

Thu Sep 22 18:18:15 CEST 2011

On 09/21/2011 02:55 PM, Andrew Yee wrote:
> Thinking about this more broadly, is there a Bioconductor package that
> lets you parse out the different features listed in a GenBank feature,
> somewhat akin to this:
>
> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

A helpful answer on Biostar by neilfws

http://biostar.stackexchange.com/questions/12405/extracting-xml-output-from-genbank-query

suggests using eutils with query

library(XML)

## formulate an e-utils query
baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml",
                baseurl, "nucleotide", "NM_000610")

I'd process this as

## retrieve the document
xml <- xmlInternalTreeParse(url)

## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to
## those with sub-node GBFeature_key with text value ('text()') equal
## to 'sig_peptide' -- there is just one of these
query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]"
node <- getNodeSet(xml, query)[[1]]

## query the extracted node, starting at the current location ('./')
query <- "./GBFeature_location/text()"
xpathApply(node, query, xmlValue)

## do string coercion in XML
query <- "string(.//GBInterval_accession/text())"
getNodeSet(node, query)

## one-liner -- associated gene name
query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]
           //GBQualifier[ GBQualifier_name/text()='gene' ]
           /GBQualifier_value/text()"
xpathApply(xml, query, xmlValue)

The key resources for this are the XPath documentation

http://www.w3.org/TR/xpath/

especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 
(Core Function Library). Also while exploring the document structure 
visually is probably a good way to start, for the hard-core the  DTD is 
where the structure is defined,

http://www.ncbi.nlm.nih.gov/data_specs/dtd/

e.g.,

http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd

Martin
>
> Thanks,
> Andrew
>
> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at post.harvard.edu>  wrote:
>> Thanks for the reply.  I guess on a broader level, is there a way to
>> extract the sig_peptide field from
>>
>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3
>>
>> I'm trying to figure out why the document reference in Carey's example
>> doesn't contain "sig_peptide" yet is visible on that web page.
>>
>> Perhaps there is another method of getting the annotation for
>> sig_peptide within GenBank?
>>
>> Thanks,
>> Andrew
>>
>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey
>> <stvjc at channing.harvard.edu>  wrote:
>>> I don't see a sig_peptide field.  You should have a look at
>>>
>>> http://www.omegahat.org/RSXML/shortIntro.html
>>>
>>> and references therein.
>>>
>>> It has been a long time since I did anything with XML per se. We did a
>>> certain amount of exposition in Chapter 8
>>> of the 2005 Springer monograph.  Since then more XPath support has come in
>>> and many new ideas help distance users from
>>> details of XML processing.  To illustrate a bit with your example, I trapped
>>> the actual document reference
>>>
>>> zz =
>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000610")
>>>
>>> and then performed an XPath query
>>>
>>>> getNodeSet(zz, "//Seq-interval_from")
>>> [[1]]
>>> <Seq-interval_from>3244</Seq-interval_from>
>>>
>>> [[2]]
>>> <Seq-interval_from>3328</Seq-interval_from>
>>>
>>> [[3]]
>>> <Seq-interval_from>5695</Seq-interval_from>
>>>
>>> and so on.  I don't recall how to do a relatively simple task like
>>> "enumerate all tags in use in a document" but it can be done with the XML
>>> package tools.  I think it will be more effective to isolate the use case
>>> and see how to use eutils to solve it fairly directly as opposed to wading
>>> through XML, but perhaps wading is inevitable.
>>>
>>>
>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at post.harvard.edu>  wrote:
>>>>
>>>> Hi, I'm looking for some guidance in terms of parsing the XML output
>>>> from a genbank query.
>>>>
>>>> result<- genbank('NM_000610', disp='data', type='uid')
>>>>
>>>> I'm trying to figure out how to use the XML package in order to parse
>>>> out the "sig_peptide" field from the XML output from the genbank
>>>> query.
>>>>
>>>> Any pointers or suggestions would be appreciated, as I'm new to XML.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>>
>>>>> sessionInfo()
>>>> R version 2.13.0 (2011-04-13)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] XML_3.2-0             annotate_1.29.4       AnnotationDbi_1.13.21
>>>> [4] Biobase_2.11.10
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] DBI_0.2-5     RSQLite_0.9-4 tools_2.13.0  xtable_1.5-6
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793