[BioC] GenomicFeatures more detail.

Mon Jun 13 16:10:44 CEST 2011

Hi Fabrice,

Currently there is no one function that can do everything you describe 
below. We are working on a new package, VariantAnnotation, that will 
provide functions to annotate variants with respect to location (ie, 
exon, intron, intergenic, ...), provide amino acid coding changes and 
allow the user to filter variants based on some criteria (e.g., whether 
or not the variant is present in dbSNP). The package is currently in the 
devel branch and is not complete. You may want to check the package in a 
couple of weeks or contact me for a status update.

In the interim you can accomplish your tasks below by using a 
combination of the "By" functions found on the man page when you type 
?transcriptsBy.  The example at the bottom of the man page demonstrates 
how to group transcripts and exons by gene, how to group coding 
sequences by transcript etc.

For example, in your first category below you want to find the part of 
the first exon not located in a cds. Using the example data on the 
?transcriptsBy,

    txdb_file <- system.file("extdata", "UCSC_knownGene_sample.sqlite",
                                 package="GenomicFeatures")
    txdb <- loadFeatures(txdb_file)

Identify the exons and transcripts per gene,

     exonByGene <- exonsBy(txdb, "gene")
     txByGene <- transcriptsBy(txdb, "gene")

 From this information you can see the exons and transcripts present in 
this genes you are interested in. Use the transcript ID to see the 
corresponding coding regions in your cdsByTx object,

     cdsByTx <- cdsBy(txdb, "tx")

You could then use findOverlaps(type="within") to identify if part the 
leading exon in exonByGene is in the corresponding cdsByTx.

Can you send me the reference of the paper you are following? I would be 
interested in looking at it.

Valerie

On 06/11/11 10:03, Fabrice Tourre wrote:
> Dear list,
>
> I fellow a paper to annotate a snp in which part of a gene. I firstly
> think that GenomicFeatures transcriptsBy method. But it seems
> transcriptsBy canno do such more detail check. Does anyone have some
> suggestion? Thanks. The define of annotaion is fellow in the paper.
>
> 1, First (non-coding) exon. If the gene has at least 2 exons, this is the
> part of the first exon that is not located inside the CDS. If the
> gene has only one exon, we do not consider it to have a first exon.
>
> 2, First intron. If the gene has at least 2 exons, this the intron
> following the first exon, provided that it is not located inside
> the CDS. Otherwise there is no first intron.
>
> 3, Noncoding exon. This is any part of an exon located outside
> the CDS region and excluding the first and last exons.
>
> 4, External intron. This is an intron located outside the CDS
> region and excluding the first and the last introns.
>
> 5, Coding exon. This is any part of an exon located inside the
> CDS region. Note that exons containing the translation start or
> stop generally contain both coding exon and noncoding (or
> first/last) exon. Coding SNPs were further subdivided into
> synonymous and nonsynonymous, according to their annotation
> in dbSNP.
>
> 6, Internal intron. This is an intron located inside the CDS region.
>
> 7, Last intron. If the gene has at least 2 exons, this is the intron
> preceding the last exon, provided that it is not located inside
> the CDS. Otherwise there is no last intron.
>
> 8, Last (noncoding) exon. If the gene has at least 2 exons, this is
> the part of the last exon that is not located inside the CDS.
> Otherwise there is no last exon.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>