[BioC] accessing GenBank features

Thu Sep 22 04:28:22 CEST 2011

Hi, Andrew.

This type of thing is done fairly robustly by bioperl and biopython,
so you could consider those languages/projects.  If you want to stick
with R, consider using something like the genbank2gff3.pl script that
ships with bioperl to convert genbank files into gff3 files that may
lend themselves to a better fit with R.  The restriction for
genbank2gff3 is that the feature types are those supported in SOFA
(sequence feature ontology).

Sean

On Wed, Sep 21, 2011 at 5:55 PM, Andrew Yee <yee at post.harvard.edu> wrote:
> Thinking about this more broadly, is there a Bioconductor package that
> lets you parse out the different features listed in a GenBank feature,
> somewhat akin to this:
>
> http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
>
> Thanks,
> Andrew
>
> On Wed, Sep 21, 2011 at 5:41 PM, Andrew Yee <yee at post.harvard.edu> wrote:
>> Thanks for the reply.  I guess on a broader level, is there a way to
>> extract the sig_peptide field from
>>
>> http://www.ncbi.nlm.nih.gov/nuccore/NM_000610.3
>>
>> I'm trying to figure out why the document reference in Carey's example
>> doesn't contain "sig_peptide" yet is visible on that web page.
>>
>> Perhaps there is another method of getting the annotation for
>> sig_peptide within GenBank?
>>
>> Thanks,
>> Andrew
>>
>> On Wed, Sep 21, 2011 at 4:07 PM, Vincent Carey
>> <stvjc at channing.harvard.edu> wrote:
>>> I don't see a sig_peptide field.  You should have a look at
>>>
>>> http://www.omegahat.org/RSXML/shortIntro.html
>>>
>>> and references therein.
>>>
>>> It has been a long time since I did anything with XML per se. We did a
>>> certain amount of exposition in Chapter 8
>>> of the 2005 Springer monograph.  Since then more XPath support has come in
>>> and many new ideas help distance users from
>>> details of XML processing.  To illustrate a bit with your example, I trapped
>>> the actual document reference
>>>
>>> zz =
>>> xmlInternalTreeParse("http://www.ncbi.nih.gov/entrez/eutils/efetch.fcgi?tool=bioconductor&rettype=xml&retmode=text&db=Nucleotide&id=NM_000610")
>>>
>>> and then performed an XPath query
>>>
>>>> getNodeSet(zz, "//Seq-interval_from")
>>> [[1]]
>>> <Seq-interval_from>3244</Seq-interval_from>
>>>
>>> [[2]]
>>> <Seq-interval_from>3328</Seq-interval_from>
>>>
>>> [[3]]
>>> <Seq-interval_from>5695</Seq-interval_from>
>>>
>>> and so on.  I don't recall how to do a relatively simple task like
>>> "enumerate all tags in use in a document" but it can be done with the XML
>>> package tools.  I think it will be more effective to isolate the use case
>>> and see how to use eutils to solve it fairly directly as opposed to wading
>>> through XML, but perhaps wading is inevitable.
>>>
>>>
>>> On Wed, Sep 21, 2011 at 12:29 PM, Andrew Yee <yee at post.harvard.edu> wrote:
>>>>
>>>> Hi, I'm looking for some guidance in terms of parsing the XML output
>>>> from a genbank query.
>>>>
>>>> result <- genbank('NM_000610', disp='data', type='uid')
>>>>
>>>> I'm trying to figure out how to use the XML package in order to parse
>>>> out the "sig_peptide" field from the XML output from the genbank
>>>> query.
>>>>
>>>> Any pointers or suggestions would be appreciated, as I'm new to XML.
>>>>
>>>> Thanks,
>>>> Andrew
>>>>
>>>>
>>>>
>>>> > sessionInfo()
>>>> R version 2.13.0 (2011-04-13)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] XML_3.2-0             annotate_1.29.4       AnnotationDbi_1.13.21
>>>> [4] Biobase_2.11.10
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] DBI_0.2-5     RSQLite_0.9-4 tools_2.13.0  xtable_1.5-6
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>