[BioC] IPI to entrez id

Dick Beyer dbeyer at u.washington.edu
Thu Mar 10 16:16:57 CET 2011


Hi Marc,

Thanks very much for the merge example.  That's so much cleaner than my usual approach.

As far as the differing numbers of IPIs, I agree that using the AC field gives a lot more IPIs than just using the ID field.  I guess I'm trying to solve a different problem than other folks.  I get these lists of IPIs from proteomics folks, and some of the IPIs wouldn't show up in the ID field of "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz", but are present in the AC field because they are no longer the current primary identifier (http://www.ebi.ac.uk/IPI/Algorithm.html#SECONDARIES).  What I am really after is the Entrez Gene ID, so for me, the IPI is just a pointer to the EG.  

However, with this approach:

library(org.Hs.eg.db)
allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO), by.x="gene_id", by.y="gene_id")

I get several thousand fewer EGs than using either ipi.HUMAN.dat or ipi.HUMAN.xrefs.  I looked at a few of the EGs that I get out of these that are not in org.Hs.eg.db, and they seem valid.  I'll do some more checking to be sure.  In ipi.HUMAN.xrefs I get 22113 unique EGs and 86719 unique IPIs from a download 2 days ago from "ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.xrefs.gz"

What I had hoped I was creating was a table with any and all IPIs (primary and secondary) but each with valid EGs.

Thanks very much for your great help,
Dick
------------------------------

Message: 24
Date: Wed, 09 Mar 2011 17:25:34 -0800
From: Marc Carlson <mcarlson at fhcrc.org>
To: bioconductor at r-project.org
Subject: Re: [BioC] IPI to entrez id
Message-ID: <4D78288E.6060904 at fhcrc.org>
Content-Type: text/plain; charset=ISO-8859-1

Hi Dick,

Is there any reason why something like this won't work for you to attach
the GO Ids?

merge(dat.all, toTable(org.Hs.egGO), by.x="EG", by.y="gene_id")



When I build the org.Hs.eg.db package, I download the IPI Ids in the
mySQL database from the following source:

ftp://ftp.ebi.ac.uk/pub/databases/IPI/current


The most recent time I did this (last week) results in an ipi to gene
table that contains only 86719 unique IPIs.  And that number drops a bit
when I fold the IPI IDs into the org database (which takes as its set of
unique entrez gene IDs the ones that are currently listed at NCBI) to
77387 distinct IPI IDs.  And if you merge with a GO table, I expect that
it will drop a bit more.

In looking at the file that you are parsing, I get exactly the same
number of unique IPI ids if I extract from the ID fields and match to
the Entrez Gene fields (which also gives me the exact same number of
entrez gene IDs).  I can only get the huge additional number of IPI Ids
from this file if I also mine the AC field and assume that these IPI Ids
also should map to the exact same things.  But the direct database dump
from EBI does not give me these mappings.  In fact, it does not seem to
even contain them.  This causes me to be concerned that maybe these IDs
may not what you think they are?


Anyhow, I hope this helps,


  Marc




On 03/08/2011 11:39 AM, Dick Beyer wrote:
Hi to all,

For several years now, I have been doing GO analysis on lists of
proteins derived from MS.  I am given IPIs by the proteomics folks and
need the corresponding Entrez Gene IDs.  Putting aside the issues of
non-unique mapping from IPI to EG, isoforms, etc., I was wondering if
anyone would comment on my method of getting the Entrez Gene IDs. I'd
really like to use Marc Carlson's merge method (shown below), but that
approach seems to miss several thousand IPI/EG matches that my method
finds.

I start with
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.HUMAN.dat.gz, and
extract a subset of the rows:

ipiHUMAN <- readLines(con = "ipi.HUMAN.3.80.dat",n=-1) # build 11feb2011
dbfetch.all <- ipiHUMAN
rm(ipiHUMAN)

# Explanation of the data format is found here
# http://www.ebi.ac.uk/2can/tutorials/formats.html#swiss

length(dbfetch.all) #  3180244
length(eg  <- grep("^DR   Entrez Gene",    dbfetch.all)) # 80296
length(ids <- grep("^ID",                  dbfetch.all)) # 86719
length(de  <- grep("^DE",                  dbfetch.all)) # 92454
length(ac  <- grep("^AC",                  dbfetch.all)) # 93720
length(ug  <- grep("^DR   UniGene",        dbfetch.all)) # 88314
length(up  <- grep("^DR   UniProtKB",      dbfetch.all)) # 110593
length(en  <- grep("^DR   ENSEMBL",        dbfetch.all)) # 77340
length(rs  <- grep("^DR   REFSEQ_REVIEWED",dbfetch.all)) # 14559

and eventually turn this into a data.frame with the columns:

"IPI","EG","GeneSymbol","UniGene","UniProtKB","ENSEMBL","REFSEQ_REVIEWED"

(Note: Not every IPI entry has every field)

For this build of the IPI file, my data.frame ends up as
dim(dat.all)
[1] 183153      7

Of these 183153 IPIs, there are 171909 unique IPIs, and 22342 unique
Entrez Gene IDs.

The merge method shown below from Marc Carlson gives 69315 unique IPIs
and 17783 unique Entrez Gene IDs (you get the same numbers whether you
use org.Hs.egGO2ALLEGS or org.Hs.egGO).

When I build my 7 column data.frame, I initially get 22305 unique
Entrez Gene IDs, and I then go through some additional steps of trying
to fill in the missing EGs.  I do this by taking the IPIs with no EGs,
and using biomaRt with UniGene, UniProtKB etc as inputs to getBM(),
and hope I get a few more EGs.

For example:

library(biomaRt)
mart <- useMart( "ensembl", dataset="hsapiens_gene_ensembl")
length(which(is.na(dat.all[,4])))
sum(z <- !is.na(dat.all[,4]))
w <-
getBM(attributes=c("entrezgene","unigene","hgnc_symbol","description"),filters="unigene",values=dat.all[z,4],
mart=mart)

By doing several of these getBM() steps, I add 37 more EGs!

My method is long and painful.  That merge approach is clean and
beautiful.

Is there a way to add to the merge argument or something that would
give me the additional 100K+ IPIs and 4500+ EGs?

------------------------------
Message: 20
Date: Fri, 18 Feb 2011 13:17:18 -0800
From: "Carlson, Marc R" <mcarlson at fhcrc.org>
To: <bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] IPI to entrez id
Message-ID:
<1688456294.5987.1298063838120.JavaMail.root at zimbra4.fhcrc.org>
Content-Type: text/plain; charset="utf-8"

Hi Viritha,

These things can never be 1:1, but you can pretty easily just cram
them all into a huge data.frame by doing this:

library(org.Hs.eg.db)
allAnnots <- merge(toTable(org.Hs.egPROSITE), toTable(org.Hs.egGO),
by.x="gene_id", by.y="gene_id")
head(allAnnots)

Once you have done this, you may notice that they are not only are
these things almost never (if ever) 1:1, but that this could have been
even worse if I had used the GO2ALL mappings (and I probably should
have, but I can't really tell because I have almost no information
about what you want to do).  Anyhow, I hope this helps you, but if you
have a more specific use for this information that you are willing to
talk about then we might be able to give you a better answer.


Marc
------------------------------

Thanks very much,
Dick
*******************************************************************************

Richard P. Beyer, Ph.D.    University of Washington
Tel.:(206) 616 7378    Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696    4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor



*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/members_fc_bioinfo.html
http://staff.washington.edu/~dbeyer



More information about the Bioconductor mailing list