[BioC] can not link the genomic positions queried and their specific annotation, when getting genomic variants annotated by biomaRt package

Tue Feb 8 18:02:25 CET 2011

Hi,

On Tue, Feb 8, 2011 at 11:47 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
> Dear listers, Sean and Steve,
>
> I have posted a similar question in this list. But, I am still
> confused. So I try to describe my question more detail, in order to
> let it more clear for you. PLEASE read all the 6 sections followed.
>
> Thanks a lot. My question is not a student's homework. And, I have
> only one way to get helps on R and bioconductor. I learned all of them
> by myself, in a somewhat isolated environment. So, your any helps are
> very very valuable for me.
>
> Jian-Feng,
>
>
>
> (1) the genomic variants data I need to be annotated:
> # SNPs,chromosome,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> (2) I want to get (annotation), there maybe multiples term for a
> specific annotation column, they need be combined in one cell. Or they
> need be in different rows of the same column. Whatever they are, the
> genomic positions should go along with their specific annotations.
>
> # SNPs,chromosome,start,end,annotation_term
> SNP_1,1,43,43,go_1:go_3
> SNP_2,2,56,56,go_100:go_1000
>
> or
>
> # SNPs,chromosome,start,end,go_term
> SNP_1,1,43,43,go_1
> SNP_1,1,43,43,go_3
> SNP_2,2,56,56,go_100
> SNP_2,2,56,56,go_1000
>
> (3) It was said that biomaRt package have such functionalities,
>
> (4) what I have got using the biomaRt package,
> library(biomaRt)
> listMarts()
> plant = useMart("plant_mart_7")
> alyr=useDataset("alyrata_eg_gene", mart=plant)
> atha = useDataset ("athaliana_eg_gene",mart=plant)
>
> listAttributes(alyr)
> listFilters(alyr)
>
> chr<-c(rep(1, 10))
> start<-c(33, 999, 3000, 7000, 9000, 10000, 12000, 19000, 80000, 100000)
> end<-c(33, 999, 3000, 7000, 9000, 10000, 12000, 19000, 80000, 100000)
>
> getBM(attributes =
> c("chromosome_name","start_position","ensembl_gene_id",
> "go_biological_process_linkage_type"), filters = c("chromosome_name",
> "start", "end"), values = list(chr, start, end), mart=alyr, uniqueRows
> = TRUE)
>
> (5) what I got
>
>   chromosome_name start_position end_position                ensembl_gene_id
> 1                1          48875        49123            Al_scaffold_0001_16
> 2                1          72255        72617            Al_scaffold_0001_21
> 3                1          10652        11944             Al_scaffold_0001_4
> 4                1          82573        83367 fgenesh1_pg.C_scaffold_1000018
> 5                1          87206        90301 fgenesh1_pg.C_scaffold_1000020
> 6                1          29681        31614 fgenesh1_pm.C_scaffold_1000009
> 7                1          51526        52636 fgenesh1_pm.C_scaffold_1000016
> 8                1          78367        80505 fgenesh1_pm.C_scaffold_1000020
> 9                1          35461        39593 fgenesh2_kg.1__12__AT1G02120.1
> 10               1          39949        42531 fgenesh2_kg.1__13__AT1G02110.1
> 11               1          46396        48761 fgenesh2_kg.1__19__AT1G02090.1
> 12               1          55814        56468 fgenesh2_kg.1__20__AT1G02070.1
> 13               1          74785        76652 fgenesh2_kg.1__23__AT1G02065.1
> 14               1          80941        82330 fgenesh2_kg.1__25__AT1G02050.1
> 15               1          80941        82330 fgenesh2_kg.1__25__AT1G02050.1
> 16               1          90714       113497 fgenesh2_kg.1__28__AT1G02010.1
> 17               1          90714       113497 fgenesh2_kg.1__28__AT1G02010.1
> 18               1           3311         6198  fgenesh2_kg.1__2__AT1G02190.2
> 19               1           3311         6198  fgenesh2_kg.1__2__AT1G02190.2
> 20               1           9512        10567  fgenesh2_kg.1__3__AT1G02180.1
> 21               1          12552        13416  fgenesh2_kg.1__5__AT1G02160.2
> 22               1             47         2523              scaffold_100001.1
> 23               1             47         2523              scaffold_100001.1
> 24               1           7429         7630              scaffold_100003.1
> 25               1          13702        15386              scaffold_100007.1
> 26               1          15665        19464              scaffold_100008.1
> 27               1          19692        20609              scaffold_100009.1
> 28               1          24515        27497              scaffold_100010.1
> 29               1          33055        34772              scaffold_100013.1
> 30               1          33055        34772              scaffold_100013.1
> 31               1          33055        34772              scaffold_100013.1
> 32               1          33055        34772              scaffold_100013.1
> 33               1          33055        34772              scaffold_100013.1
> 34               1          33055        34772              scaffold_100013.1
> 35               1          43130        46178              scaffold_100016.1
> 36               1          49553        51020              scaffold_100018.1
> 37               1          49553        51020              scaffold_100018.1
> 38               1          57579        57871              scaffold_100022.1
> 39               1          58865        72177              scaffold_100023.1
>   go_biological_process_linkage_type
> 1
> 2
> 3                                 IEA
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14                                IEA
> 15                                IEA
> 16                                IEA
> 17                                IEA
> 18                                IEA
> 19                                IEA
> 20
> 21
> 22                                IEA
> 23                                IEA
> 24
> 25
> 26                                IEA
> 27
> 28
> 29                                IEA
> 30                                IEA
> 31                                IEA
> 32                                IEA
> 33                                IEA
> 34                                IEA
> 35
> 36                                IEA
> 37                                IEA
> 38
> 39
>
> (6) my problem is I can not link the genomic positions I queried and
> their specific annotation.

I don't understand what you're asking. But, as I pointed out in my
original email, your "getBM" call doesn't return any annotations, it
only returns the "type" of annotation evidence each gene has. "IEA"
tells you what is the source of the annotation you would have received
had you included a column for that annotation.

You'll note that my first email, I changed your getBM slightly:

result <- getBM(attributes=c("chromosome_name","start_position","ensembl_gene_id",
 "go_biological_process_linkage_type", "go_biological_process_id"),
 filters = c("chromosome_name", "start", "end"),
 values = list(chr, start, end), mart=alyr, uniqueRows = TRUE)

See how I added a "go_biological_process_id" as one of the
`attributes` to return? You should, too.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact