[BioC] biomaRt: getSequence returns "Sequence unavailable" where I'd expect NA

Jan Kim jttkim at googlemail.com
Wed Jun 25 16:47:59 CEST 2014

Dear All,

I just noticed that sequence columns the data frame returned by
biomaRt's getSequence function contain the string "Sequence unavailable"
in certain conditions. Here's a demo:

    ggMart <- useDataset("ggallus_gene_ensembl", mart = useMart("ensembl"));
    getSequence(id = "ENSGALG00000017787", type = "ensembl_gene_id", seqType = "coding", mart = ggMart);

This gives me:

                    coding    ensembl_gene_id
    1 Sequence unavailable ENSGALG00000017787

The ENSEMBL gene in question is some RNA component of a telomerase [1],
which explains why there is no (protein) coding sequence.

Nonetheless, I was surprised that this fact is indicated by inserting
a human-readable string, rather than the machine-recognisable value
NA, in this circumstance. Or as a more detailed account, I didn't
notice the few "Sequence unavailable" entries in a table of thousands
of rows and wrote everything into a FASTA file, and only when something
further down the pipeline was surprised at the "e" (fortunately non-
IUPAC), my attention was drawn to this problem.

So this post is to (1) alert others to this sometimes surprising feature
and (2) to suggest replacing the "Sequence unavailable" entries with NAs
if the biomaRt authors should happen to read this.

Best regards, Jan

[1] http://www.ensembl.org/Gallus_gallus/Gene/Summary?db=core;g=ENSGALG00000017787;r=9:19428817-19428871;t=ENSGALT00000028494
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

More information about the Bioconductor mailing list