[BioC] ReportingTools gene IDs

Tue Apr 29 15:55:10 CEST 2014

hi Assa,

If you look up the help for ?"publish-methods", there is support for
DESeqResults (the 4th data type listed).  DESeqResults is the
DataFrame produced by DESeq2::results().  The point of creating this
class was to help simplify the hand-off to ReportingTools. Maybe this
will help?

Mike

On Tue, Apr 29, 2014 at 9:27 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:
> Hi Jim,
>
> thanks for the tip.
> Unfortunately i am not sure i understand the idea behind it.
>
> You say, it is possible to work straight with the DESeqDataSet Object, but
> than the function expects a data.frame to work with. If I understand the
> mechanism with which the publish function is working - it takes the
> DESeqDataSet obejct and, using the results function, coerce it into a
> data.frame.
>
> This is the function I ended up using:
>
> fun <- function(df, object, ...){
>      df$ENSEMBL <- rownames(df)
>     annot <- select(org.Mm.eg.db, df$ENSEMBL, c("SYMBOL","GENENAME"),
> "ENSEMBL")
>     if(nrow(annot) > nrow(df)) annot <- annot[!duplicated(annot[,1]),]
>     df <- data.frame(annot, df)
>     df <- df[ , -which(names(df) %in% c("ENSEMBL.1"))]
>     df$ENSEMBL <- hwrite(as.character(df$ENSEMBL),
>                          link = paste0("
> http://www.ensembl.org/Mus_musculus/Gene/Summary?g=",
>                          as.character(df$ENSEMBL)), table = FALSE)
>     df
> }
>
>
> As you can see, I changes the column df$ENSEMBL into the rownames of the
> coerced df. this is because the fit object doen't have a column name
> ENSEMBL.
>
> Q. Is there a way to add coluns to the object?
>
> Am I doing it in the most efficient way?
>
> thanks for the help and the tip about the Ensembl links (mouse genome -
> Mm).
>
>
> Assa
>
>
>
> On Fri, Apr 25, 2014 at 3:43 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
>> Hi Assa,
>>
>> Gabriel actually already gave you the answer, and it is yes. You just have
>> to add things to the .modifyDF argument. There are several examples in
>>
>> http://www.bioconductor.org/packages/release/bioc/
>> vignettes/ReportingTools/inst/doc/basicReportingTools.pdf
>>
>> and here is one (untested) that should apply to your situation:
>>
>> fun <- function(df, object, ...){
>>     if(!ENSEMBL %in% names(df))
>>         stop("The column name for ensembl ids has to be 'ENSEMBL'!")
>>     ensids <- df$ENSEMBL
>>     whichcol <- which(names(df) == "ENSEMBL")
>>     annot <- select(org.Mm.eg.db, ensids, c("SYMBOL","GENENAME"),
>> "ENSEMBL")
>>     if(nrow(annot) > nrow(df)) annot <- annot[!duplicated(annot[,1]),]
>>     df <- data.frame(annot, df[,-whichcol])
>>     df$ENSEMBL <- hwrite(as.character(df$ENSEMBL),
>>                          link = paste0(" http://www.ensembl.org/Homo_
>> sapiens/Gene/Summary?g=",
>>                          as.character(df$ENSEMBL)), table = FALSE)
>>     df
>> }
>>
>>
>> This function implicitly assumes (and checks) that there is an ENSEMBL
>> column in your data.frame that it can use to extract the Ensembl IDs. It
>> also assumes that your species is human, and that you have the org.Mm.eg.db
>> package already loaded. It then gets the symbol and genename for those IDs,
>> and does a really naive subsetting of the data if there are duplicates.
>> Other more sophisticated things are possible, but I leave it to you to make
>> any such modifications.
>>
>> You would use this (as Gabriel already said), as part of an argument
>> passed in via .modifyDF. You also need modifyReportDF as well. So your
>> publish argument would now look like
>>
>> publish(fit,des2Report, pvalueCutoff=0.05,annotation.db="org.Mm.eg.db",
>> factor = colData(fit)$condition,reportDir="./reports", .modifyDF =
>> list(modifyReportDF, fun))
>>
>> That at least is the basic idea, and you might need to play around to make
>> things work correctly.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> On 4/25/2014 4:21 AM, Assa Yeroslaviz wrote:
>>
>>> Hi Gabriel,
>>>
>>> Thanks for the quick answer I will look into that as soon as I have the
>>> time.
>>> Another question was if it is possible to work directy with the Ensembl
>>> IDs.
>>>
>>> I have a table of ~37K ensembl Ids, for which almost 50% have no Entrez
>>> Ids, so I can't convert them. Is there a way to work directly with the
>>> Ensembl IDs and still benefit from the annotation.de <
>>> http://annotation.de> possibilities?
>>>
>>> Thanks
>>>
>>> Assa
>>>
>>>
>>>
>>> On Thu, Apr 24, 2014 at 4:48 PM, Gabriel Becker <gmbecker at ucdavis.edu<mailto:
>>> gmbecker at ucdavis.edu>> wrote:
>>>
>>>     I wrote my previous message too quickly. Apologies.
>>>
>>>     Your functions must have the signature
>>>
>>>     function(df, object, ...)
>>>
>>>     df is current data.frame represenation of the object,
>>>     object is the *original* object (so that the class can be identified),
>>>     ... are passed in from the call to publish
>>>
>>>     And you can just place the generic modifyReportDF function at the
>>>     beginning of the list, rather than using getMethod. The getMethod
>>>     thing I said is for when you want to apply the default handling
>>>     for a *different* class to your object. It is a rare use-case, but
>>>     came up recently so it was on my mind.
>>>
>>>     That will teach me to respond quickly to emails early in the morning.
>>>
>>>     Sorry about that.
>>>
>>>     ~G
>>>
>>>
>>>     On Thu, Apr 24, 2014 at 7:18 AM, Gabriel Becker
>>>     <gmbecker at ucdavis.edu <mailto:gmbecker at ucdavis.edu>> wrote:
>>>
>>>         Assa,
>>>
>>>         In general yes, if you want to add to the table you will be
>>>         working with the data.frame.
>>>
>>>         You can do so after the initial conversion, though, so you
>>>         don't have to recreate the wheel to get from your object to an
>>>         initial data.frame.
>>>
>>>         To modify the default table (data.frame) generated for an
>>>         object, you can pass publish()'s  .modifyDF parameter a
>>>         function of list of functions, each of which should accept
>>>         object (the data.frame) and "..." and return a data.frame.
>>>
>>>         These will be called in order, each accepting the output from
>>>         the last. The output of the final function is what will be
>>>         transformed into HTML and inserted into the report.
>>>
>>>         You'll probably want to add the default handling of your
>>>         object type, which you can do by putting
>>>         getMethod("modifyReportDF", "<your object's class>") at the
>>>         beginning of the list.
>>>
>>>         See section 4 of the ReportingTools basics vignette for
>>>         example code.
>>>
>>>         HTH,
>>>         ~G
>>>
>>>
>>>         On Thu, Apr 24, 2014 at 6:54 AM, Assa Yeroslaviz
>>>         <frymor at gmail.com <mailto:frymor at gmail.com>> wrote:
>>>
>>>             Thanks Jim,
>>>
>>>             I have found in one of the forums a response from Jason
>>>             (thanks again) for
>>>             the option to set annotation.db=NULL and though force the
>>>             publish command
>>>             to work with the Ids I provide in the DESeqDataSet object.
>>>
>>>             So this is now working, But I would like to have also the
>>>             option to add
>>>             some annotations to the table.
>>>
>>>             Is this only possible when working directly with a data
>>>             .frame?
>>>
>>>             Thanks again
>>>             Assa
>>>
>>>             On Thu, Apr 24, 2014 at 3:45 PM, James W. MacDonald
>>>             <jmacdon at uw.edu <mailto:jmacdon at uw.edu>> wrote:
>>>
>>>             > Hi Assa,
>>>             >
>>>             > There may well be a way to work with Ensembl IDs, and
>>>             you will likely get
>>>             > an answer to your direct question from one of the
>>>             maintainers.
>>>             >
>>>             > However you should note that ReportingTools simply takes
>>>             the input object
>>>             > and then coerces the data to a data.frame, which is then
>>>             used to create the
>>>             > HTML table. You can always create the data.frame to your
>>>             own liking up
>>>             > front, and then pass that to publish(). While this is
>>>             more work than just
>>>             > passing in the DESeqDataSet, you do have complete
>>>             control over the output.
>>>             >
>>>             > Best,
>>>             >
>>>             > Jim
>>>             >
>>>             >
>>>             >
>>>             > On 4/24/2014 8:50 AM, Assa Yeroslaviz wrote:
>>>             >
>>>             >> Hi,
>>>             >>
>>>             >> Is it neccessary to have entrez gene IDs to work with
>>>             this package?
>>>             >>
>>>             >> I am working on a dataset with Ensembl IDs. Do I need
>>>             to convert them to
>>>             >> Entrez?
>>>             >>
>>>             >> When trying to create a report for a DESeqDataSet or
>>>             DESeqResults objects
>>>             >> i
>>>             >> am getting the error messege:
>>>             >>
>>>             >> Error: Ids do not appear to be Entrez Ids for the
>>>             specified species.
>>>             >>
>>>             >> Is there a way to work straight with the ensembl IDs?
>>>             >>
>>>             >> Thanks
>>>             >>
>>>             >> Assa
>>>             >>
>>>             >> my script:
>>>             >>
>>>             >> head(Counts_set)
>>>             >> A_pKO_aV_FCS G_pKO_aV_FCS M_pKO_aV_FCS D_pKO_aV
>>>             >> J_pKO_aV
>>>             >> ENSMUSG00000000001 4744         4632         4535 4748
>>>             >> 3736
>>>             >> ENSMUSG00000000003    0            0            0  0
>>>             >>  0
>>>             >> ENSMUSG00000000028 1246         1420         1429 2304
>>>             >> 1261
>>>             >> ENSMUSG00000000031    3           25           65  0
>>>             >> 50
>>>             >> ENSMUSG00000000037    0            0            0  0
>>>             >>  0
>>>             >> ENSMUSG00000000049    0            0            3  1
>>>             >>  3
>>>             >>
>>>             >> cds <- DESeqDataSetFromMatrix (
>>>             >>      countData = Counts_set,
>>>             >>      colData   = colData,
>>>             >>      design    = ~  condition
>>>             >>      )
>>>             >>
>>>             >> fit = DESeq(cds)
>>>             >> des2Report <- HTMLReport(shortName
>>>             =paste('RNAseq_analysis_', group1, "_",
>>>             >> group2, sep=""),title ='RNA-seq analysis of
>>>             differential expression using
>>>             >> DESeq2',reportDirectory = "./reports")
>>>             >> publish(fit,des2Report,
>>>             pvalueCutoff=0.05,annotation.db="org.Mm.eg.db",
>>>             >> factor = colData(fit)$condition,reportDir="./reports")
>>>             >> Error: Ids do not appear to be Entrez Ids for the
>>>             specified species.
>>>             >> finish(des2Report)
>>>             >>
>>>             >>
>>>             >>  sessionInfo()
>>>             >>>
>>>             >> R version 3.1.0 (2014-04-10)
>>>             >> Platform: x86_64-pc-linux-gnu (64-bit)
>>>             >>
>>>             >> locale:
>>>             >>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>             >>   [3] LC_TIME=en_US.UTF-8      LC_COLLATE=en_US.UTF-8
>>>             >>   [5] LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8
>>>             >>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>             >>   [9] LC_ADDRESS=C       LC_TELEPHONE=C
>>>             >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>             >>
>>>             >> attached base packages:
>>>             >> [1] parallel  stats graphics  grDevices utils datasets
>>>              methods
>>>             >> [8] base
>>>             >>
>>>             >> other attached packages:
>>>             >>   [1] org.Mm.eg.db_2.14.0   ReportingTools_2.4.0
>>>              AnnotationDbi_1.26.0
>>>             >>   [4] Biobase_2.24.0    RSQLite_0.11.4          DBI_0.2-7
>>>             >>   [7] knitr_1.5   DESeq2_1.4.0
>>>             >>  RcppArmadillo_0.4.200.0
>>>             >> [10] Rcpp_0.11.1   GenomicRanges_1.16.2  GenomeInfoDb_1.0.2
>>>             >> [13] IRanges_1.22.3  BiocGenerics_0.10.0
>>>             >>
>>>             >> loaded via a namespace (and not attached):
>>>             >>   [1] annotate_1.42.0    AnnotationForge_1.6.0
>>>             >> BatchJobs_1.2
>>>             >>   [4] BBmisc_1.5     BiocParallel_0.6.0
>>>             >> biomaRt_2.20.0
>>>             >>   [7] Biostrings_2.32.0    biovizBase_1.12.0
>>>             >> bitops_1.0-6
>>>             >> [10] brew_1.0-6   BSgenome_1.32.0
>>>             >> Category_2.30.0
>>>             >> [13] cluster_1.14.4   codetools_0.2-8
>>>             >> colorspace_1.2-4
>>>             >> [16] dichromat_2.0-0    digest_0.6.4
>>>             >> edgeR_3.6.0
>>>             >> [19] evaluate_0.5.3   fail_1.2
>>>             >> foreach_1.4.2
>>>             >> [22] formatR_0.10   Formula_1.1-1
>>>             >> genefilter_1.46.0
>>>             >> [25] geneplotter_1.42.0   GenomicAlignments_1.0.0
>>>             >> GenomicFeatures_1.16.0
>>>             >> [28] ggbio_1.12.0   ggplot2_0.9.3.1
>>>             >> GO.db_2.14.0
>>>             >> [31] GOstats_2.30.0   graph_1.42.0
>>>             >> grid_3.1.0
>>>             >> [34] gridExtra_0.9.1    GSEABase_1.26.0
>>>             >> gtable_0.1.2
>>>             >> [37] Hmisc_3.14-4   hwriter_1.3
>>>             >> iterators_1.0.7
>>>             >> [40] lattice_0.20-24    latticeExtra_0.6-26
>>>             >> limma_3.20.1
>>>             >> [43] locfit_1.5-9.1   MASS_7.3-29
>>>             >> Matrix_1.1-2
>>>             >> [46] munsell_0.4.2    PFAM.db_2.14.0
>>>             >> plyr_1.8.1
>>>             >> [49] proto_0.3-10   RBGL_1.40.0
>>>             >> RColorBrewer_1.0-5
>>>             >> [52] RCurl_1.95-4.1   reshape2_1.2.2
>>>             >> R.methodsS3_1.6.1
>>>             >> [55] R.oo_1.18.0    Rsamtools_1.16.0
>>>             >> rtracklayer_1.24.0
>>>             >> [58] R.utils_1.29.8   scales_0.2.4
>>>             >> sendmailR_1.1-2
>>>             >> [61] splines_3.1.0    stats4_3.1.0
>>>             >> stringr_0.6.2
>>>             >> [64] survival_2.37-7    tools_3.1.0
>>>             >> VariantAnnotation_1.10.0
>>>             >> [67] XML_3.98-1.1   xtable_1.7-3
>>>             >> XVector_0.4.0
>>>             >> [70] zlibbioc_1.10.0
>>>             >>
>>>             >>         [[alternative HTML version deleted]]
>>>             >>
>>>             >> _______________________________________________
>>>             >> Bioconductor mailing list
>>>             >> Bioconductor at r-project.org
>>>             <mailto:Bioconductor at r-project.org>
>>>
>>>             >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>             >> Search the archives: http://news.gmane.org/gmane.
>>>             >> science.biology.informatics.conductor
>>>             >>
>>>             >
>>>             > --
>>>             > James W. MacDonald, M.S.
>>>             > Biostatistician
>>>             > University of Washington
>>>             > Environmental and Occupational Health Sciences
>>>             > 4225 Roosevelt Way NE, # 100
>>>             > Seattle WA 98105-6099
>>>             >
>>>             >
>>>
>>>                     [[alternative HTML version deleted]]
>>>
>>>             _______________________________________________
>>>             Bioconductor mailing list
>>>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org
>>> >
>>>
>>>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>             Search the archives:
>>>             http://news.gmane.org/gmane.science.biology.informatics.
>>> conductor
>>>
>>>
>>>
>>>
>>>         --         Gabriel Becker
>>>         Graduate Student
>>>         Statistics Department
>>>         University of California, Davis
>>>
>>>
>>>
>>>
>>>     --     Gabriel Becker
>>>     Graduate Student
>>>     Statistics Department
>>>     University of California, Davis
>>>
>>>
>>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor