[BioC] How to get a unique line of annotation for each specific genomic position by using biomaRt package

Mao Jianfeng jianfeng.mao at gmail.com
Tue Feb 8 16:22:32 CET 2011


Thanks a lot. Steve.

You have given me a good guide.

Jian-Feng,

2011/2/8 Steve Lianoglou <mailinglist.honeypot at gmail.com>:
> Hi,
>
> On Tue, Feb 8, 2011 at 9:24 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
>> Dear Steve,
>>
>> Thanks for your kindness. Could you please give me more directions on
>> this annotation problem?
>>
>> #########################
>> (1)
>> #########################
>> I want each my SNP has just one line of annotation in separate
>> columns. If there are the multiple terms for the same attributes (for
>> example, multiple go terms are shared at that location), I would like
>> to include them in the same column with symbols (such ;  :  | )
>> separated each of them.
>>
>> for example I have SNPs like this:
>> # SNPs,chr,start,end
>> SNP_1,1,43,43
>> SNP_2,2,56,56
>>
>> I would have annotations like this:
>> # SNPs,chr,start,end,go_term
>> SNP_1,1,43,43,go_1:go_3
>> SNP_2,2,56,56,go_100:go_1000
>
> I'll give you this one ... continuing from my previous example: say
> the getBM call stores its return value in `result`:
>
> library(plyr)
> summary <- ddply(result, .(chromosome_name, start_position), function(x) {
>  new.x <- x[1,]
>  new.x$go_biological_process_id <- paste(x$go_biological_process_id,
> collapse="|")
>  new.x
> })
>
> I'll leave the rest as an exercise for you.
>
> -steve
>
>
>>
>> #########################
>> (2)
>> #########################
>> Alternatively, I would like to have the SNPs position be combined with
>> its annotations results, so as to know which the annotation lines are
>> corresponding to. I do not know how to do that using bioconductor
>> packages. Look the example followed:
>>
>> for example I have SNPs like this:
>> # SNPs,chr,start,end
>> SNP_1,1,43,43
>> SNP_2,2,56,56
>>
>> I would have annotations like this:
>> # SNPs,chr,start,end,go_term
>> SNP_1,1,43,43,go_1
>> SNP_1,1,43,43,go_3
>> SNP_2,2,56,56,go_100
>> SNP_2,2,56,56,go_1000
>>
>> Jian-Feng,
>>
>> 2011/2/8 Steve Lianoglou <mailinglist.honeypot at gmail.com>:
>>> Hi,
>>>
>>> On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
>>>> Dear listers,
>>>>
>>>> I am new to bioconductor.
>>>>
>>>> I have genomic variations (SNP, indel, CNV) coordinated by
>>>> chromosome:start:end in GFF/BED/VCF format. One genomic variation is
>>>> defined a specific genomic position (in base pair).
>>>>
>>>> for example:
>>>> # SNPs,chr,start,end
>>>> SNP_1,1,43,43
>>>> SNP_2,2,56,56
>>>>
>>>> I would like to get such genomic variations annotated by various
>>>> gen/protein/passway centric annotations (as listed in BioMart
>>>> databases). I tried R/bioconductor biomaRt package. But, I failed to
>>>> get a unique line of annotation for a specific genomic position. Could
>>>> you please give any directions on that?
>>>
>>> Could you explain a bit more about what you mean when you say "get a
>>> unique line of annotation"?
>>>
>>> The only informative info `getBM` query is returning is the gene id
>>> for the location, and the GO term evidence code
>>> (go_biological_process_linkage_type). If you add, say,
>>> "go_biological_process_id", you get the biological go terms associated
>>> with the position, ie:
>>>
>>> result <- getBM(attributes=c("chromosome_name","start_position","ensembl_gene_id",
>>>  "go_biological_process_linkage_type", "go_biological_process_id"),
>>>  filters = c("chromosome_name", "start", "end"),
>>>  values = list(chr, start, end), mart=alyr, uniqueRows = TRUE)
>>>
>>> If you problem is that some positions have more than one row, like so:
>>>
>>> chromosome_name start_position     ensembl_gene_id  ...
>>> go_biological_process_id
>>>              1          33055   scaffold_100013.1
>>> GO:0006355
>>>              1          33055   scaffold_100013.1
>>> GO:0006886
>>>              1          33055   scaffold_100013.1
>>> GO:0006913
>>>              1          33055   scaffold_100013.1
>>> GO:0007165
>>>              1          33055   scaffold_100013.1
>>> GO:0007264
>>>
>>> this happens because multiple go terms are shared at that location. If
>>> you want to just pick one, but you'll have to decide how you want to
>>> do that.
>>>
>>> If you want to somehow summarize each chromosome/start_position into
>>> one row, you can iterate over the data by this combination easily
>>> with, say, the ddply function from the plyr package:
>>>
>>> library(plyr)
>>> summary <- ddply(result, .(chromosome_name, start_position), function(x) {
>>>  # x will have all of the rows for a given chromosome_name / start_position
>>>  # combo. We can arbitrarily just return the first row, but you'll likely
>>>  # want to do something smarter:
>>>  x[1,]
>>> })
>>>
>>> If you look at `summary`, you'll have one row per position.
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>>
>>
>> --
>> Jian-Feng, Mao
>>
>> the Institute of Botany,
>> Chinese Academy of Botany,
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



-- 
Jian-Feng, Mao

the Institute of Botany,
Chinese Academy of Botany,



More information about the Bioconductor mailing list