[BioC] GEOquery: incomplete feature data from GPL soft file

Sean Davis sdavis2 at mail.nih.gov
Sat Jul 13 21:44:38 CEST 2013


On Mon, Jun 17, 2013 at 7:09 AM, Renaud Gaujoux
<renaud at mancala.cbio.uct.ac.za> wrote:
> Hi,
>
> I am getting incorrect feature annotation data when loading a dataset from
> GPL4133.
> The feature data looks like this:
>
> head(fData(eset)[, 1:2])
>        ID  COL
> 12     12  266
> NA   <NA> <NA>
> NA.1 <NA> <NA>
> 15     15  266
> 16     16  266
> NA.2 <NA> <NA>
>
> This possibly also results in having less features in the final expression
> matrix, if it is at some point restricted to feature names matching the
> ones in the loaded annotation data.
>
> The real issue here seems to be with the soft file being badly formatted,
> with lines having double quotes where there should not be:
>
> 12      266     148     A_24_P66027     A_24_P66027     FALSE
> NM_004900       NM_004900       9582    APOBEC3B        apolipoprotein B
> mRNA editing enzyme, catalytic polypeptide-like 3B"    Hs.226307 ...
>
> Looking at the way GEOquery loads the annotation soft files, we see that
> they are read using `quote="\""`, which clearly returns a messed up
> data.frame.

Thanks, Renaud for the report.  I finally got around to making this
adjustment, so this should work for you now.

Sean



More information about the Bioconductor mailing list