[BioC] accessing GenBank features

Thu Sep 22 18:34:20 CEST 2011

Thanks for posting this...I was going to post it as well.

Question:  is there a way for the genbank() function to be updated to
pull this additional information down by default?  In the Biostar post
by neilfws, he points out that the sig_peptide information isn't in
the XML data retrieved by genbank() .

Thanks,
Andrew

On Thu, Sep 22, 2011 at 12:18 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/21/2011 02:55 PM, Andrew Yee wrote:
>>
>> Thinking about this more broadly, is there a Bioconductor package that
>> lets you parse out the different features listed in a GenBank feature,
>> somewhat akin to this:
>>
>> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
>
> A helpful answer on Biostar by neilfws
>
> http://biostar.stackexchange.com/questions/12405/extracting-xml-output-from-genbank-query
>
> suggests using eutils with query
>
> library(XML)
>
> ## formulate an e-utils query
> baseurl <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
> url <- sprintf("%s?db=%s&id=%s&rettype=gb&retmode=xml",
>               baseurl, "nucleotide", "NM_000610")
>
> I'd process this as
>
> ## retrieve the document
> xml <- xmlInternalTreeParse(url)
>
> ## extract GBFeature nodes ('//GBFeature') restricted ('[...]') to
> ## those with sub-node GBFeature_key with text value ('text()') equal
> ## to 'sig_peptide' -- there is just one of these
> query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]"
> node <- getNodeSet(xml, query)[[1]]
>
> ## query the extracted node, starting at the current location ('./')
> query <- "./GBFeature_location/text()"
> xpathApply(node, query, xmlValue)
>
> ## do string coercion in XML
> query <- "string(.//GBInterval_accession/text())"
> getNodeSet(node, query)
>
> ## one-liner -- associated gene name
> query <- "//GBFeature[ GBFeature_key/text()='sig_peptide' ]
>          //GBQualifier[ GBQualifier_name/text()='gene' ]
>          /GBQualifier_value/text()"
> xpathApply(xml, query, xmlValue)
>
> The key resources for this are the XPath documentation
>
> http://www.w3.org/TR/xpath/
>
> especially sections 2.5 (Abbreviated Syntax; the place to start) and 4 (Core
> Function Library). Also while exploring the document structure visually is
> probably a good way to start, for the hard-core the  DTD is where the
> structure is defined,
>
> http://www.ncbi.nlm.nih.gov/data_specs/dtd/
>
> e.g.,
>
> http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_GBSeq.mod.dtd
>
> Martin
>>
>> Thanks,
>> Andrew
>>
>> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee<yee at post.harvard.edu>  wrote:
>>>
>>> Thanks for the reply.  I guess on a broader level, is there a way to
>>> extract the sig_peptide field from
>>>
>>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3
>>>
>>> I'm trying to figure out why the document reference in Carey's example
>>> doesn't contain "sig_peptide" yet is visible on that web page.
>>>
>>> Perhaps there is another method of getting the annotation for
>>> sig_peptide within GenBank?
>>>
>>> Thanks,
>>> Andrew
>>>
>>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey
>>> <stvjc at channing.harvard.edu>  wrote:
>>>>
>>>> I don't see a sig_peptide field.  You should have a look at
>>>>
>>>> http://www.omegahat.org/RSXML/shortIntro.html
>>>>
>>>> and references therein.
>>>>
>>>> It has been a long time since I did anything with XML per se. We did a
>>>> certain amount of exposition in Chapter 8
>>>> of the 2005 Springer monograph.  Since then more XPath support has come
>>>> in
>>>> and many new ideas help distance users from
>>>> details of XML processing.  To illustrate a bit with your example, I
>>>> trapped
>>>> the actual document reference
>>>>
>>>> zz =
>>>>
>>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000610")
>>>>
>>>> and then performed an XPath query
>>>>
>>>>> getNodeSet(zz, "//Seq-interval_from")
>>>>
>>>> [[1]]
>>>> <Seq-interval_from>3244</Seq-interval_from>
>>>>
>>>> [[2]]
>>>> <Seq-interval_from>3328</Seq-interval_from>
>>>>
>>>> [[3]]
>>>> <Seq-interval_from>5695</Seq-interval_from>
>>>>
>>>> and so on.  I don't recall how to do a relatively simple task like
>>>> "enumerate all tags in use in a document" but it can be done with the
>>>> XML
>>>> package tools.  I think it will be more effective to isolate the use
>>>> case
>>>> and see how to use eutils to solve it fairly directly as opposed to
>>>> wading
>>>> through XML, but perhaps wading is inevitable.
>>>>
>>>>
>>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee<yee at post.harvard.edu>
>>>>  wrote:
>>>>>
>>>>> Hi, I'm looking for some guidance in terms of parsing the XML output
>>>>> from a genbank query.
>>>>>
>>>>> result<- genbank('NM_000610', disp='data', type='uid')
>>>>>
>>>>> I'm trying to figure out how to use the XML package in order to parse
>>>>> out the "sig_peptide" field from the XML output from the genbank
>>>>> query.
>>>>>
>>>>> Any pointers or suggestions would be appreciated, as I'm new to XML.
>>>>>
>>>>> Thanks,
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 2.13.0 (2011-04-13)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] XML_3.2-0             annotate_1.29.4       AnnotationDbi_1.13.21
>>>>> [4] Biobase_2.11.10
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] DBI_0.2-5     RSQLite_0.9-4 tools_2.13.0  xtable_1.5-6
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793
>