[BioC] HapMap gene list

James W. MacDonald jmacdon at med.umich.edu
Thu Aug 5 02:36:07 CEST 2010


  The \t is a tab character. You may do better by using the default sep 
argument rather than by specifying one yourself.

Best,

Jim



On 8/4/10 4:49 PM, noxyport at gmail.com wrote:
> You are right! Sorry to bother you with this.
> However, there is still something wrong. When I export the file again
> (write.table) there are CDS and UTR included and when you run:
>
>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="	")
>> nrow(hapmap)
> [1] 171701
>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ]
>> nrow(hapmap2)
> [1] 12718
>> hapmap2[205,]
>         V1     V2   V3       V4       V5 V6 V7 V8
> 2759 chr1 UCSC_1 mRNA 11840109 11841579  .  -  .
> V9
> 2759 ID=NM_002521;Alias=NPPB;Note=natriuretic peptide precursor B
> preproprotein;summary=This gene is a member of the natriuretic peptide
> family and encodes a secreted protein which functions as a cardiac
> hormone. The protein undergoes two cleavage events%2C one within the
> cell and a second after secretion into the blood. The proteins
> biological actions include natriuresis%2C diuresis%2C
> vasorelaxation%2C inhibition of renin and aldosterone secretion%2C and
> a key role in cardiovascular homeostasis. A high concentration of this
> protein in the bloodstream is indicative of heart failure. Mutations
> in this gene have been associated with postmenopausal osteoporosis.
> Publication Note:  This RefSeq record includes a subset of the
> publications that are available for this gene. Please see the Entrez
> Gene record to access additional
> publications.\nchr1\tUCSC_1\tthree_prime_UTR\t11840109\t11840298\t.\t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840299\t11840315\t.\t-\t1\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11840858\t11841113\t.\t-\t0\tParent=NM_002521\nchr1\tUCSC_1\tCDS\t11841346\t11841477\t.\t-\t0\tParent=NM_002521\nchr1\tUCSC_1\tfive_prime_UTR\t11841478\t11841579\t.\t-\t.\tParent=NM_002521\nchr1\tUCSC_1\tmRNA\t11902712\t11909067\t.\t-\t.\tID=NM_138346;Alias=KIAA2013;Note=hypothetical
> protein LOC90231\nchr1\tUCSC_1\tthree_prime_UTR\t11902712\t11902958\t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11902959\t11902976\t.\t-\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11905280\t11906133\t.\t-\t1\tParent=NM_138346\nchr1\tUCSC_1\tCDS\t11907849\t11908881\t.\t-\t0\tParent=NM_138346\nchr1\tUCSC_1\tfive_prime_UTR\t11908882\t11909067\t.\t-\t.\tParent=NM_138346\nchr1\tUCSC_1\tmRNA\t11917333\t11958180\t.\t+\t.\tID=NM_000302;Alias=PLOD1;Note=lysyl
> hydroxylase precursor;summary=Lysyl hydroxylase is a membrane-bound
> homodimeric protein localized to the cisternae of the endoplasmic
> reticulum. The enzyme (cofactors iron and ascorbate) catalyzes the
> hydroxylation of lysyl residues in collagen-like peptides. The
> resultant hydroxylysyl groups are attachment sites for carbohydrates
> in col
> ... (shortend here)
>
> I have no idea where R takes thes "\t.*" parts from but I think they
> screw the whole dataframe somehow. Any suggestions?
>
> Thanks
>
>
>
>
> On Wed, Aug 4, 2010 at 7:08 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com>  wrote:
>> On Wed, Aug 4, 2010 at 1:41 PM, noxyport at gmail.com<noxyport at gmail.com>  wrote:
>>> Hi,
>>>
>>> I have a problem with the gene list (gff version3 file) HapMap is
>>> using (ftp://ftp.ncbi.nlm.nih.gov/hapmap/gbrowse/2009-02_phaseII+III/gff/refGene_hg18_tests_11Apr07.gff.gz).
>>> I tried loading the file into R and selecting all "mRNA" entries but
>>> something seems to go wrong with it:
>>>
>>>> hapmap=read.table("refGene_hg18_tests_11Apr07.gff", header=F, sep="    ")
>>>> nrow(hapmap)
>>> [1] 171701
>>>> hapmap2=hapmap[which(hapmap$V3=="mRNA"), ]
>>>> nrow(hapmap2)
>>> [1] 12718
>>>> hapmap[(2210:2220), (1:3)]
>> Here, you want to use hapmap2 and not hapmap.
>>
>> Kasper
>>
>>
>>> 2210 chr1 UCSC_1           mRNA
>>> 2211 chr1 UCSC_1 five_prime_UTR
>>> 2212 chr1 UCSC_1 five_prime_UTR
>>> 2213 chr1 UCSC_1            CDS
>>> 2214 chr1 UCSC_1            CDS
>>> 2215 chr1 UCSC_1            CDS
>>> 2216 chr1 UCSC_1            CDS
>>> 2217 chr1 UCSC_1            CDS
>>> 2218 chr1 UCSC_1            CDS
>>> 2219 chr1 UCSC_1            CDS
>>> 2220 chr1 UCSC_1            CDS
>>> Can anyone explain why this could be? Probably, the large descriptive
>>> column (V9) but I don't see the failure.
>>>
>>> I have to admit that it is probably not the best way to use this file
>>> but I do not find any other source (RefSeq, UCSC), which contains the
>>> same genomic regions for the genes annotated as in HapMap. Which NCBI
>>> 36 build did they use and where can I download a gene file with
>>> chromosome, gene start and stop matching with HapMap?
>>>
>>> Thanks for your help!
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list