[BioC] How to get a unique line of annotation for each specific genomic position by using biomaRt package

Tue Feb 8 15:59:35 CET 2011

Hi,

On Tue, Feb 8, 2011 at 9:24 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
> Dear Steve,
>
> Thanks for your kindness. Could you please give me more directions on
> this annotation problem?
>
> #########################
> (1)
> #########################
> I want each my SNP has just one line of annotation in separate
> columns. If there are the multiple terms for the same attributes (for
> example, multiple go terms are shared at that location), I would like
> to include them in the same column with symbols (such ;  :  | )
> separated each of them.
>
> for example I have SNPs like this:
> # SNPs,chr,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> I would have annotations like this:
> # SNPs,chr,start,end,go_term
> SNP_1,1,43,43,go_1:go_3
> SNP_2,2,56,56,go_100:go_1000

I'll give you this one ... continuing from my previous example: say
the getBM call stores its return value in `result`:

library(plyr)
summary <- ddply(result, .(chromosome_name, start_position), function(x) {
 new.x <- x[1,]
 new.x$go_biological_process_id <- paste(x$go_biological_process_id,
collapse="|")
 new.x
})

I'll leave the rest as an exercise for you.

-steve

>
> #########################
> (2)
> #########################
> Alternatively, I would like to have the SNPs position be combined with
> its annotations results, so as to know which the annotation lines are
> corresponding to. I do not know how to do that using bioconductor
> packages. Look the example followed:
>
> for example I have SNPs like this:
> # SNPs,chr,start,end
> SNP_1,1,43,43
> SNP_2,2,56,56
>
> I would have annotations like this:
> # SNPs,chr,start,end,go_term
> SNP_1,1,43,43,go_1
> SNP_1,1,43,43,go_3
> SNP_2,2,56,56,go_100
> SNP_2,2,56,56,go_1000
>
> Jian-Feng,
>
> 2011/2/8 Steve Lianoglou <mailinglist.honeypot at gmail.com>:
>> Hi,
>>
>> On Tue, Feb 8, 2011 at 5:49 AM, Mao Jianfeng <jianfeng.mao at gmail.com> wrote:
>>> Dear listers,
>>>
>>> I am new to bioconductor.
>>>
>>> I have genomic variations (SNP, indel, CNV) coordinated by
>>> chromosome:start:end in GFF/BED/VCF format. One genomic variation is
>>> defined a specific genomic position (in base pair).
>>>
>>> for example:
>>> # SNPs,chr,start,end
>>> SNP_1,1,43,43
>>> SNP_2,2,56,56
>>>
>>> I would like to get such genomic variations annotated by various
>>> gen/protein/passway centric annotations (as listed in BioMart
>>> databases). I tried R/bioconductor biomaRt package. But, I failed to
>>> get a unique line of annotation for a specific genomic position. Could
>>> you please give any directions on that?
>>
>> Could you explain a bit more about what you mean when you say "get a
>> unique line of annotation"?
>>
>> The only informative info `getBM` query is returning is the gene id
>> for the location, and the GO term evidence code
>> (go_biological_process_linkage_type). If you add, say,
>> "go_biological_process_id", you get the biological go terms associated
>> with the position, ie:
>>
>> result <- getBM(attributes=c("chromosome_name","start_position","ensembl_gene_id",
>>  "go_biological_process_linkage_type", "go_biological_process_id"),
>>  filters = c("chromosome_name", "start", "end"),
>>  values = list(chr, start, end), mart=alyr, uniqueRows = TRUE)
>>
>> If you problem is that some positions have more than one row, like so:
>>
>> chromosome_name start_position     ensembl_gene_id  ...
>> go_biological_process_id
>>              1          33055   scaffold_100013.1
>> GO:0006355
>>              1          33055   scaffold_100013.1
>> GO:0006886
>>              1          33055   scaffold_100013.1
>> GO:0006913
>>              1          33055   scaffold_100013.1
>> GO:0007165
>>              1          33055   scaffold_100013.1
>> GO:0007264
>>
>> this happens because multiple go terms are shared at that location. If
>> you want to just pick one, but you'll have to decide how you want to
>> do that.
>>
>> If you want to somehow summarize each chromosome/start_position into
>> one row, you can iterate over the data by this combination easily
>> with, say, the ddply function from the plyr package:
>>
>> library(plyr)
>> summary <- ddply(result, .(chromosome_name, start_position), function(x) {
>>  # x will have all of the rows for a given chromosome_name / start_position
>>  # combo. We can arbitrarily just return the first row, but you'll likely
>>  # want to do something smarter:
>>  x[1,]
>> })
>>
>> If you look at `summary`, you'll have one row per position.
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>
> --
> Jian-Feng, Mao
>
> the Institute of Botany,
> Chinese Academy of Botany,
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact