[BioC] How to retrieve gene information

Thu Dec 27 19:35:16 CET 2007

Hi Allen --

affy snp wrote:
> Hi Martin,
>
> Thanks for brining this to my attention.
> I am wondering:
>
> (1) Should the location information only be specified
> in a way using 'e'? Will sth like 1234567 work?
There's nothing magical going on. 'filt' just compares numeric values in 
'loc' to another numeric value. Any numeric comparison is possible.

> (2) While the result return the gene function, can it also include
> the gene symbol?

The function returns a list, with names of the list elements being the 
Entrez ids. There are other maps in the org.* packages to get at other 
types of information, for instance org.Hs.egSYMBOL to get 'symbol' 
values associated with each Entrez id. List available maps with

 > library(org.Hs.eg.db)
 > ls("package:org.Hs.eg.db")

and find out about each with

 > ?org.Hs.egSYMBOL
> (3) In case I want to look up genes of multiple chromosomal regions,
> how easily can I do that if I supply with a table of regions such as:
>
> chr      Start                End     
> 1     2171559         245522847    
> 1     77465510       82255496    
> 2     1                   243018229        
> 2     145329129     187974870    
> 3     1                    59538720    

Pretty easily, though now you need to specify your problem a little more 
fully, and write some R code to do what you want. One possibility is 
that you would like the gene symbols satisfying all conditions within a 
row (chr=1 & loc>s2711559 & end<245522847), but for any of the rows. A 
first approach might start by generalizing 'filt'

 > filt <- function(eid, czome, start, end) {
+     loc <- abs(eid) # either strand
+     any(names(eid)==czome & loc>start & loc<end)
+ }

and putting the code to find Entrez ids into a function

 > eids <- function(crit, map) {
+     found <- unlist(eapply(map, filt,
+                            czome=crit[["chr"]],
+                            start=crit[["start"]],
+                            end=crit[["end"]]))
+     ls(map[found])
+ }

'crit' is the criterion we'll use to select entrez ides from 'map', it 
will be a named vector with elements 'chr', 'start', 'end'.

We'll 'apply' our 'eids' function to each row of a data frame 
representing your criteria:

 > df <- data.frame(chr=c(1,1,2,2,3),
+         start=c(2171559, 77465510, 1, 145329129, 1),
+         end=c(245522847, 82255496, 245522847, 187974870, 59538720))
 > res <- apply(df, 1, eids, map=org.Hs.egCHRLOC)
 > str(res)
List of 5
 $ : chr [1:1833] "24" "27" "34" "58" ...
 $ : chr [1:13] "5737" "8880" "10561" "10964" ...
 $ : chr [1:1165] "14" "33" "52" "72" ...
 $ : chr [1:162] "90" "92" "390" "518" ...
 $ : chr [1:425] "30" "93" "95" "211" ...

The result is a list, with each element corresponding to the Entrez ids 
found in the corresponding row of 'df'.

It turns out that there are

 > sum(sapply(res, length))
[1] 3598

Entrez IDs but some Entrez ids occur in more than one region

 > length(unique(unlist(res)))
[1] 3423

You could find the GENENAMEs for all of these with

 > nms <- lapply(res, lookUp, "org.Hs.eg.db", "GENENAME")

Depending on what you want to do in the future, you might also think 
about making these into GeneSets, e.g.,

 > library(GSEABase)
 > setNames <- apply(df, 1, paste, collapse="_")
 > gss <- mapply(GeneSet, res, setName=setNames,
+               MoreArgs=list(
+                 geneIdType=EntrezIdentifier(),
+                 collectionType=ComputedCollection()))
 > gsc <- GeneSetCollection(gss)
 > gsc
GeneSetCollection
  names: 1_2171559_245522847, 1_77465510_82255496, ..., 3_1_59538720 (5 
total)
  unique identifiers: 24, 27, ..., 727811 (3423 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: ComputedCollection (1 total)

Martin

>
> Thank you and happy holiday!
>
> Allen
>
> On Dec 26, 2007 12:26 PM, Martin Morgan <mtmorgan at fhcrc.org 
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     Hi Allen --
>
>     Also the 'org.*' annotation packages provide organism-centric
>     annotations. These packages have environment-like key-value structures
>     that typically map Entrez identifiers to other information. The
>     CHRLOC
>     variables map Entrez ids to named integer vectors. Names on the vector
>     are chromosome locations, values are the strand (+ or -) and base
>     start position. Thus
>
>     > library(annotate)
>     > library(org.Hs.eg.db )
>     >
>     > filt <- function(eid) {
>     +     loc <- abs(eid) # either strand
>     +     any(names(eid)=="3" & loc>1e8 & loc<1.1e8)
>     + }
>     > found <- unlist(eapply(org.Hs.egCHRLOC , filt))
>     > sum(found) # named genes found
>     [1] 33
>     >
>     > eids <- ls(org.Hs.egCHRLOC[found]) # subset; extract entrez ids
>     > head(lookUp(eids, "org.Hs.eg.db", "GENENAME")) # first 6
>     $`214`
>     [1] "activated leukocyte cell adhesion molecule"
>
>     $`868`
>     [1] "Cas-Br-M (murine) ecotropic retroviral transforming sequence b"
>
>     $`961`
>     [1] "CD47 molecule"
>
>     $`1295`
>     [1] "collagen, type VIII, alpha 1"
>
>     $`6152`
>     [1] "ribosomal protein L24"
>
>     $`9666`
>     [1] "zinc finger DAZ interacting protein 3"
>
>     The annotation packages are from data sources taken from a
>     snapshot of
>     data sources (see org.Hs.eg_dbInfo()) that might differ from the data
>     sources used for biomaRt; the merits of this include (a)
>     reproduciblilty and (b) relative speed of computation (no internet
>     download required).
>
>     Martin
>
>
>     toedling at ebi.ac.uk <mailto:toedling at ebi.ac.uk> writes:
>
>     > Hi Allen,
>     >
>     > have a look at the biomaRt package:
>     > http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html
>     > and see the vignette section 4.5 Task 5.
>     > You may want to select additional attributes returned, such as
>     > "description" etc. Use the "listAttributes" function to obtain
>     of list of
>     > possible attributes.
>     >
>     > Best and Merry Christmas,
>     > Joern
>     >
>     >> Dear list,
>     >>
>     >> I am wondering if I supply with chromosome information, start
>     and end
>     >> position
>     >> of a region, can any Bioconductor package could return a list
>     of genes
>     >> with
>     >> short descriptions within those regions?
>     >>
>     >> Thanks and Merry Christmas!
>     >>
>     >> Allen
>     >>
>     >>
>     >
>     > _______________________________________________
>     > Bioconductor mailing list
>     > Bioconductor at stat.math.ethz.ch
>     <mailto:Bioconductor at stat.math.ethz.ch>
>     > https://stat.ethz.ch/mailman/listinfo/bioconductor
>     > Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>     --
>     Martin Morgan
>     Computational Biology / Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>     Location: Arnold Building M2 B169
>     Phone: (206) 667-2793
>
>