[BioC] HapMap gene list

noxyport at gmail.com noxyport at gmail.com
Wed Aug 4 19:41:38 CEST 2010


I have a problem with the gene list (gff version3 file) HapMap is
using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III/gff/refGene_hg18_tests_11Apr07.gff.gz).
I tried loading the file into R and selecting all "mRNA" entries but
something seems to go wrong with it:

> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="    ")
> nrow(hapmap)
[1] 171701
> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ]
> nrow(hapmap2)
[1] 12718
> hapmap[(2210:2220), (1:3)]
       V1     V2             V3
2210 chr1 UCSC_1           mRNA
2211 chr1 UCSC_1 five_prime_UTR
2212 chr1 UCSC_1 five_prime_UTR
2213 chr1 UCSC_1            CDS
2214 chr1 UCSC_1            CDS
2215 chr1 UCSC_1            CDS
2216 chr1 UCSC_1            CDS
2217 chr1 UCSC_1            CDS
2218 chr1 UCSC_1            CDS
2219 chr1 UCSC_1            CDS
2220 chr1 UCSC_1            CDS

Can anyone explain why this could be? Probably, the large descriptive
column (V9) but I don't see the failure.

I have to admit that it is probably not the best way to use this file
but I do not find any other source (RefSeq, UCSC), which contains the
same genomic regions for the genes annotated as in HapMap. Which NCBI
36 build did they use and where can I download a gene file with
chromosome, gene start and stop matching with HapMap?

Thanks for your help!

More information about the Bioconductor mailing list