[BioC] Inconsistency in illuminaHumanv4.db ?

Mark Dunning mark.dunning at gmail.com
Tue Dec 6 14:09:39 CET 2011


Hi Holger,

Apologies for the delay in replying to you about this. The annotation
packages I provide are different to most other Bioconductor annotation
packages in that we have attempted to re-map the manufacturer probes
to the genome and transcriptome. The issues that you have highlighted
are due to our re-annotation and not with the building of the
Biocondutor package.

In most cases there should only be a single genomic location, and no
space-separated values.

I've checked the examples where we have space-separated values in the
genomic location. There are 11 such cases in the 106343 unique human
probes across all chips we create annotations for. In every case, this
is where the annotation script has been unable to find a genomic
location from the BLAST searches against the transcriptomic sequence
databases or the reference genome. It then has a last-gasp attempt at
getting a location through a BLAT search against the genome and these
cases has found multiple equal best scoring hits. There is a real
problem with this BLAT search in that it only gets called right at the
end after the annotations (SNPs, repeats, etc.) and for those probes
the quality is given as "No match". This will be corrected in future
packages.

The BLAT searches are run against hg19 and are correct for the examples
I've just been looking at. I think the problem may be that the BLAT hits
are for partial alignments and in some cases only a small section of the
probe is aligning.

An example is ILMN_1773455 which is given the genomic locations
chr1:149906516:149906531:+ chr1:185572746:185572761:+. These correspond to
two BLAT hits as follows for only 16 of the 50 bases:

>chr1
         Length = 249250621

 Score = 31 bits (80), Expect = 1e+00
 Identities = 16/16 (100%)
 Strand = Plus / Plus

Query: 31        atgaagaagaacagtg 46
                ||||||||||||||||
Sbjct: 149906516 atgaagaagaacagtg 149906531


 Score = 31 bits (80), Expect = 1e+00
 Identities = 16/16 (100%)
 Strand = Plus / Plus

Query: 21        tggaaatgctatgaag 36
                ||||||||||||||||
Sbjct: 185572746 tggaaatgctatgaag 185572761



Hope this helps,

Mark


On Wed, Nov 30, 2011 at 12:39 AM, Holger [guest] <guest at bioconductor.org> wrote:
>
> I am using illuminaHumanv4.db for my research, so first of all,  thank you for maintaining this very valuable package!
>
>
> When working with the illuminaHumanv4listNewMappings, I realised that some genomic coordinates are separated with " " instead of ",". Almost all other multiple entries are separated with a ",". Additionlly, genomic position of those entries does not seem to match with ucsc hg19 browser:
>
> require(illuminaHumanv4.db)
> test <- illuminaHumanv4fullReannotation()
> str(test)
> grep(" ", test$GenomicLocation, value=T)
> [1] "chr9:70645819:70645844:+ chr9:68298969:68298994:+ chr9:42251008:42251033:+ chr9:45442520:45442545:+"
> [2] "chr19:53832784:53832812:+ chr19:53268654:53268682:+"
> [3] "chr7:142008849:142008868:+ chr1:161139601:161139616:+"
> [4] "chr7:142008849:142008868:+ chr1:161139601:161139616:+"
> [5] "chrX:71034915:71034941:+ chrX:70888863:70888889:+ chrX:70885403:70885429:+"
> [6] "chrX:71034915:71034941:+ chrX:70888863:70888889:+ chrX:70885403:70885429:+"
> [7] "chr22:18979472:18979497:+ chrX:70888863:70888889:+ chrX:70885403:70885429:+"
> [8] "chr1:149906516:149906531:+ chr1:185572746:185572761:+"
> [9] "chr1:176811981:176811996:+ chr1:161139601:161139616:+"
>
> Is there any specific reason for this?
>
> When looking on illuminaHumanv4.db_1.10.0, other probes were effected, but the problem appeared to be present, too.
>
>  -- output of sessionInfo():
>
> R version 2.14.0 (2011-10-31)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] illuminaHumanv4.db_1.12.1 org.Hs.eg.db_2.6.4
> [3] RSQLite_0.10.0            DBI_0.2-5
> [5] AnnotationDbi_1.16.5      Biobase_2.14.0
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.12.3
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list