[BioC] ReportingTools - trouble incorporating annotations

James W. MacDonald jmacdon at uw.edu
Wed Jun 19 22:50:14 CEST 2013


Hi Gabriel,

On 6/19/2013 3:22 PM, Gabriel Becker wrote:
> James,
>
> Thanks for responding to Sam's question.
>
> One small point, you can control the coercion of the object to a 
> data.frame within the publish call via the .toDF argument, which is 
> applied directly before .modifyDF.

Nice! I think I missed that one. I'll take a closer look.

>
> The defaults of these steps (coercion and modification of data.frames) 
> can also be extended on an S4 class basis, by creating new methods for 
> toReportDF and modifyReportDF if you find yourself always using the 
> same customization across multiple scripts (this approaches creating a 
> package which depends on ReportingTools, however, and is likely 
> overkill in most situations).

Well, I have a package that I am converting from using annaffy to using 
ReportingTools, so it already depends on ReportingTools. I certainly 
have need to modify things.

As an example, it is not uncommon for me to do a GO hypergeometric test 
for multiple different contrasts, and want to output the results. 
However the default in ReportingTools for these objects is to create a 
table with a bunch of glyphs, and then automatically make tables for all 
the significant GO terms that contain the underlying genes.

This is OK as long as you don't try to A.) Do a bunch of contrasts and 
then go GO analyses on all of them (thereby creating literally thousands 
of little files) and then B.) zip up and send to a collaborator who is 
on Windows.

Turns out the Windows zip utility will silently fail when unzipping a 
file with thousands of files within it. No error, nothing. Just 
truncated output. And a PI who wants to know where the rest of it is...

So having my own personal way of outputting GO results will be quite 
useful. Thanks for the heads-up.

Best,

Jim


>
> There is a flowchart for the full publish mechanism in 
> inst/examples/publishFlowchart.svg, which I am attaching here as well 
> for convenience.
>
> ~G
>
>
> On Wed, Jun 19, 2013 at 10:55 AM, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Sam,
>
>     Sorry, my bad. One of the arguments for DGEList is 'genes', which is
>
>        genes: data frame containing annotation information for the
>               tags/transcripts/genes.
>
>     So yeah, when you are creating your DGELists, add in the
>     annotation data for each gene/transcript/whatever, and then go
>     from there.
>
>     Best,
>
>     Jim
>
>
>
>     On 6/19/2013 1:37 PM, Sam McInturf wrote:
>
>         Jim,
>            When I look at my DGELRT objects I don't have a $genes
>         field: (to be verbose, DGELists is a list of DGELRT objects
>         (list(glmLRT(), glmLRT(),...))
>
>         > DGELists$Roots$genes
>         NULL
>
>         Looking at what fields I do have: coefficients, fitted.values,
>         deviance, df.residual, design, offset, dispersion, method,
>         samples, AveLogCPM, table, comparison, df.test
>
>         Looking at the DGEGLM which my DGELists were made, I have no
>         gene slot either.  I added the rownames of the counts table in
>         the DGEGLM to a new $genes:
>
>         >fit$genes <- rownames(fit$counts)
>
>         but the the $genes slot did not get passed onto the DGELRTs.
>          Is it best just to add the $genes field to each DGELRT and
>         then follow what you stated before?
>
>         Thanks,
>
>         > sessionInfo()
>         R version 3.0.1 (2013-05-16)
>         Platform: x86_64-pc-linux-gnu (64-bit)
>
>         locale:
>          [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>          [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>          [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>          [7] LC_PAPER=C                 LC_NAME=C
>          [9] LC_ADDRESS=C               LC_TELEPHONE=C
>         [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
>         attached base packages:
>         [1] parallel  stats     graphics  grDevices utils     datasets
>          methods
>         [8] base
>
>         other attached packages:
>          [1] ReportingTools_2.0.1 org.At.tair.db_2.9.0 RSQLite_0.11.4
>          [4] DBI_0.2-7            AnnotationDbi_1.22.6 Biobase_2.20.0
>          [7] GenomicRanges_1.12.4 IRanges_1.18.1       BiocGenerics_0.6.0
>         [10] edgeR_3.2.3          limma_3.16.5        
>         BiocInstaller_1.10.2
>
>         loaded via a namespace (and not attached):
>          [1] annotate_1.38.0         AnnotationForge_1.2.1  
>         biomaRt_2.16.0
>          [4] Biostrings_2.28.0       biovizBase_1.8.1        bitops_1.0-5
>          [7] BSgenome_1.28.0         Category_2.26.0        
>         cluster_1.14.4
>         [10] colorspace_1.2-2        dichromat_2.0-0         digest_0.6.3
>         [13] genefilter_1.42.0       GenomicFeatures_1.12.2  ggbio_1.8.3
>         [16] ggplot2_0.9.3.1         GO.db_2.9.0            
>         GOstats_2.26.0
>         [19] graph_1.38.2            grid_3.0.1            
>          gridExtra_0.9.1
>         [22] GSEABase_1.22.0         gtable_0.1.2          
>          Hmisc_3.10-1.1
>         [25] hwriter_1.3             labeling_0.1          
>          lattice_0.20-15
>         [28] MASS_7.3-26             munsell_0.4             PFAM.db_2.9.0
>         [31] plyr_1.8                proto_0.3-10            RBGL_1.36.2
>         [34] RColorBrewer_1.0-5      RCurl_1.95-4.1        
>          reshape2_1.2.2
>         [37] R.methodsS3_1.4.2       R.oo_1.13.0            
>         Rsamtools_1.12.3
>         [40] rtracklayer_1.20.2      R.utils_1.23.2          scales_0.2.3
>         [43] splines_3.0.1           stats4_3.0.1            stringr_0.6.2
>         [46] survival_2.37-4         tools_3.0.1            
>         VariantAnnotation_1.6.6
>         [49] XML_3.96-1.1            xtable_1.7-1          
>          zlibbioc_1.6.0
>
>
>
>
>         On Wed, Jun 19, 2013 at 10:55 AM, James W. MacDonald
>         <jmacdon at uw.edu <mailto:jmacdon at uw.edu> <mailto:jmacdon at uw.edu
>         <mailto:jmacdon at uw.edu>>> wrote:
>
>             Hi Sam,
>
>             First, please always give us the results of sessionInfo().
>         This is
>             especially critical in the case of ReportingTools, which
>         has been
>             fundamentally altered between the previous and current
>         versions of
>             BioC.
>
>
>             On 6/19/2013 11:12 AM, Sam McInturf wrote:
>
>                 Bioconductors,
>                      I am working on a RNA seq analysis project and am
>         having
>                 trouble
>                 publishing an HTML report for it.  I am unsure of how
>         to make
>                 my DE genes
>                 have the same ID as what publish() will accept when
>         passing an
>                 argument to
>                 'annotation'.
>                      I mapped the reads using tophat and passed the
>         TAIR 10
>                 gtf file to the
>                 -G option.  When i counted my reads I used the
>                 summarizeOverlaps function
>                 from GenomicRanges and again used this same file.  I
>         called
>                 differential
>                 expression in edgeR using the GLM methods.  So the
>         rownames
>                  of my DE table
>                 are the AGI identifiers (AT#G#####).  I loaded the
>         org.At.tair.db
>                 annotations and passed it to HTMLReport in:
>
>                 publish(DGELists[["Roots"]], myHTML, countTable = cpmMat,
>                 conditions =
>                 group, annotation = "org.At.tair.db", pvaueCutoff =
>         0.01, lfc
>                 =2, n = 1000,
>                 name = "RootsLRT")
>                 Error: More than half of your IDs could not be mapped.
>                 In addition: Warning message:
>                 In .DGELRT.to.data.frame(object, ...) : NAs introduced
>         by coercion
>
>                 which makes sense, because publish() is looking for
>         Entrez IDs
>                 (right?)
>
>                 How do I proceed?
>
>
>             Here I assume you are running R-3.0.x and the current
>         release of BioC.
>
>             When you run publish() on anything but a data.frame, the first
>             step is to coerce to a data.frame using a set of
>         assumptions that
>             might not hold in your case (or there may be defaults that you
>             don't like). Because of this, I tend to just coerce to a
>             data.frame myself and then publish() that directly. This also
>             allows you to pass in arguments to .modifyDF which is kind
>         of sweet.
>
>             In the case of a DGELRT or DEGExact object, there is a 'genes'
>             slot that will be used to annotate the output of topTags().
>             Ideally you would just add the annotation you want to that
>         slot
>             first. So you could do something like
>
>             annot <- select(org.At.tair.db,
>         DGELists[["Roots"]]$genes[,<Tair
>             column goes here>], c("SYMBOL","GENENAME","OTHERSTUFF"))
>
>
>             and then put that in your DGEobjects. Now you can do
>         something like
>
>             outlst <- lapply(DGELists, topTags, otherargsgohere)
>
>             htmlst <- lapply(seq_len(length(DGELists)) function(x)
>
>             HTMLReport(namevector[x], titlevector[x], otherargs))
>
>             and you can define a function similar to this function I
>         use for
>             Entrez Gene IDs:
>
>             entrezLinks <- function (df, ...){
>                 df$ENTREZID <- hwriter::hwrite(as.character(df$ENTREZID),
>                     link = paste0("http://www.ncbi.nlm.nih.gov/gene/",
>
>             as.character(df$ENTREZID)),
>                     table = FALSE)
>                 return(df)
>             }
>
>             but modified for the Tair equivalent and then
>
>             lapply(seq_len(length(htmlst)), function(x)
>         publish(outlst[[x]],
>
>             htmlst[[x]], .modifyDF = samsTairLinkFun)))
>             lapply(htmlst, finish)
>
>             et voila!
>
>             You can also then use htmlst to make a bunch of links in an
>             index.html page.
>
>             indx <- HTMLReport("index", "A bunch of links for this expt",
>             reportDirectory=".", baseUrl = "")
>             publish(hwriter::hwrite("Here are links", page(indx),
>         header=2,
>             br=TRUE), indx)
>             publish(Link(htmlst, report=indx), indx)
>             finish(indx)
>
>             Best,
>
>             Jim
>
>
>
>                 Thanks in advance!
>
>
>             --     James W. MacDonald, M.S.
>             Biostatistician
>             University of Washington
>             Environmental and Occupational Health Sciences
>             4225 Roosevelt Way NE, # 100
>             Seattle WA 98105-6099
>
>
>
>
>         -- 
>         Sam McInturf
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>     _______________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>     Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
> -- 
> Gabriel Becker
> Graduate Student
> Statistics Department
> University of California, Davis

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list