[BioC] ReportingTools gene IDs

Tue Apr 29 16:25:53 CEST 2014

Hi Assa,

Mike has answered one question. And this is the reason you have to use 
the argument

.modifyDF=  list(modifyReportDF, fun)

The first list item is the already existing function in ReportingTools 
that knows what a DESeqResults object is, and what to do to coerce it 
to a data.frame.

To answer the other question, the answer is yes! In fact you are 
already adding columns to the data.frame (you add the two columns from 
the 'annot' object). You can add other columns in a similar fashion.

Best,

Jim

On Tuesday, April 29, 2014 9:55:10 AM, Michael Love wrote:
> hi Assa,
>
> If you look up the help for ?"publish-methods", there is support for
> DESeqResults (the 4th data type listed).  DESeqResults is the
> DataFrame produced by DESeq2::results().  The point of creating this
> class was to help simplify the hand-off to ReportingTools. Maybe this
> will help?
>
> Mike
>
> On Tue, Apr 29, 2014 at 9:27 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:
>> Hi Jim,
>>
>> thanks for the tip.
>> Unfortunately i am not sure i understand the idea behind it.
>>
>> You say, it is possible to work straight with the DESeqDataSet Object, but
>> than the function expects a data.frame to work with. If I understand the
>> mechanism with which the publish function is working - it takes the
>> DESeqDataSet obejct and, using the results function, coerce it into a
>> data.frame.
>>
>> This is the function I ended up using:
>>
>> fun <- function(df, object, ...){
>>       df$ENSEMBL <- rownames(df)
>>      annot <- select(org.Mm.eg.db, df$ENSEMBL, c("SYMBOL","GENENAME"),
>> "ENSEMBL")
>>      if(nrow(annot) > nrow(df)) annot <- annot[!duplicated(annot[,1]),]
>>      df <- data.frame(annot, df)
>>      df <- df[ , -which(names(df) %in% c("ENSEMBL.1"))]
>>      df$ENSEMBL <- hwrite(as.character(df$ENSEMBL),
>>                           link = paste0("
>> http://www.ensembl.org/Mus_musculus/Gene/Summary?g=",
>>                           as.character(df$ENSEMBL)), table = FALSE)
>>      df
>> }
>>
>>
>> As you can see, I changes the column df$ENSEMBL into the rownames of the
>> coerced df. this is because the fit object doen't have a column name
>> ENSEMBL.
>>
>> Q. Is there a way to add coluns to the object?
>>
>> Am I doing it in the most efficient way?
>>
>> thanks for the help and the tip about the Ensembl links (mouse genome -
>> Mm).
>>
>>
>> Assa
>>
>>
>>
>> On Fri, Apr 25, 2014 at 3:43 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>>
>>> Hi Assa,
>>>
>>> Gabriel actually already gave you the answer, and it is yes. You just have
>>> to add things to the .modifyDF argument. There are several examples in
>>>
>>> http://www.bioconductor.org/packages/release/bioc/
>>> vignettes/ReportingTools/inst/doc/basicReportingTools.pdf
>>>
>>> and here is one (untested) that should apply to your situation:
>>>
>>> fun <- function(df, object, ...){
>>>      if(!ENSEMBL %in% names(df))
>>>          stop("The column name for ensembl ids has to be 'ENSEMBL'!")
>>>      ensids <- df$ENSEMBL
>>>      whichcol <- which(names(df) == "ENSEMBL")
>>>      annot <- select(org.Mm.eg.db, ensids, c("SYMBOL","GENENAME"),
>>> "ENSEMBL")
>>>      if(nrow(annot) > nrow(df)) annot <- annot[!duplicated(annot[,1]),]
>>>      df <- data.frame(annot, df[,-whichcol])
>>>      df$ENSEMBL <- hwrite(as.character(df$ENSEMBL),
>>>                           link = paste0(" http://www.ensembl.org/Homo_
>>> sapiens/Gene/Summary?g=",
>>>                           as.character(df$ENSEMBL)), table = FALSE)
>>>      df
>>> }
>>>
>>>
>>> This function implicitly assumes (and checks) that there is an ENSEMBL
>>> column in your data.frame that it can use to extract the Ensembl IDs. It
>>> also assumes that your species is human, and that you have the org.Mm.eg.db
>>> package already loaded. It then gets the symbol and genename for those IDs,
>>> and does a really naive subsetting of the data if there are duplicates.
>>> Other more sophisticated things are possible, but I leave it to you to make
>>> any such modifications.
>>>
>>> You would use this (as Gabriel already said), as part of an argument
>>> passed in via .modifyDF. You also need modifyReportDF as well. So your
>>> publish argument would now look like
>>>
>>> publish(fit,des2Report, pvalueCutoff=0.05,annotation.db="org.Mm.eg.db",
>>> factor = colData(fit)$condition,reportDir="./reports", .modifyDF =
>>> list(modifyReportDF, fun))
>>>
>>> That at least is the basic idea, and you might need to play around to make
>>> things work correctly.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> On 4/25/2014 4:21 AM, Assa Yeroslaviz wrote:
>>>
>>>> Hi Gabriel,
>>>>
>>>> Thanks for the quick answer I will look into that as soon as I have the
>>>> time.
>>>> Another question was if it is possible to work directy with the Ensembl
>>>> IDs.
>>>>
>>>> I have a table of ~37K ensembl Ids, for which almost 50% have no Entrez
>>>> Ids, so I can't convert them. Is there a way to work directly with the
>>>> Ensembl IDs and still benefit from the annotation.de <
>>>> http://annotation.de> possibilities?
>>>>
>>>> Thanks
>>>>
>>>> Assa
>>>>
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 4:48 PM, Gabriel Becker <gmbecker at ucdavis.edu<mailto:
>>>> gmbecker at ucdavis.edu>> wrote:
>>>>
>>>>      I wrote my previous message too quickly. Apologies.
>>>>
>>>>      Your functions must have the signature
>>>>
>>>>      function(df, object, ...)
>>>>
>>>>      df is current data.frame represenation of the object,
>>>>      object is the *original* object (so that the class can be identified),
>>>>      ... are passed in from the call to publish
>>>>
>>>>      And you can just place the generic modifyReportDF function at the
>>>>      beginning of the list, rather than using getMethod. The getMethod
>>>>      thing I said is for when you want to apply the default handling
>>>>      for a *different* class to your object. It is a rare use-case, but
>>>>      came up recently so it was on my mind.
>>>>
>>>>      That will teach me to respond quickly to emails early in the morning.
>>>>
>>>>      Sorry about that.
>>>>
>>>>      ~G
>>>>
>>>>
>>>>      On Thu, Apr 24, 2014 at 7:18 AM, Gabriel Becker
>>>>      <gmbecker at ucdavis.edu <mailto:gmbecker at ucdavis.edu>> wrote:
>>>>
>>>>          Assa,
>>>>
>>>>          In general yes, if you want to add to the table you will be
>>>>          working with the data.frame.
>>>>
>>>>          You can do so after the initial conversion, though, so you
>>>>          don't have to recreate the wheel to get from your object to an
>>>>          initial data.frame.
>>>>
>>>>          To modify the default table (data.frame) generated for an
>>>>          object, you can pass publish()'s  .modifyDF parameter a
>>>>          function of list of functions, each of which should accept
>>>>          object (the data.frame) and "..." and return a data.frame.
>>>>
>>>>          These will be called in order, each accepting the output from
>>>>          the last. The output of the final function is what will be
>>>>          transformed into HTML and inserted into the report.
>>>>
>>>>          You'll probably want to add the default handling of your
>>>>          object type, which you can do by putting
>>>>          getMethod("modifyReportDF", "<your object's class>") at the
>>>>          beginning of the list.
>>>>
>>>>          See section 4 of the ReportingTools basics vignette for
>>>>          example code.
>>>>
>>>>          HTH,
>>>>          ~G
>>>>
>>>>
>>>>          On Thu, Apr 24, 2014 at 6:54 AM, Assa Yeroslaviz
>>>>          <frymor at gmail.com <mailto:frymor at gmail.com>> wrote:
>>>>
>>>>              Thanks Jim,
>>>>
>>>>              I have found in one of the forums a response from Jason
>>>>              (thanks again) for
>>>>              the option to set annotation.db=NULL and though force the
>>>>              publish command
>>>>              to work with the Ids I provide in the DESeqDataSet object.
>>>>
>>>>              So this is now working, But I would like to have also the
>>>>              option to add
>>>>              some annotations to the table.
>>>>
>>>>              Is this only possible when working directly with a data
>>>>              .frame?
>>>>
>>>>              Thanks again
>>>>              Assa
>>>>
>>>>              On Thu, Apr 24, 2014 at 3:45 PM, James W. MacDonald
>>>>              <jmacdon at uw.edu <mailto:jmacdon at uw.edu>> wrote:
>>>>
>>>>              > Hi Assa,
>>>>              >
>>>>              > There may well be a way to work with Ensembl IDs, and
>>>>              you will likely get
>>>>              > an answer to your direct question from one of the
>>>>              maintainers.
>>>>              >
>>>>              > However you should note that ReportingTools simply takes
>>>>              the input object
>>>>              > and then coerces the data to a data.frame, which is then
>>>>              used to create the
>>>>              > HTML table. You can always create the data.frame to your
>>>>              own liking up
>>>>              > front, and then pass that to publish(). While this is
>>>>              more work than just
>>>>              > passing in the DESeqDataSet, you do have complete
>>>>              control over the output.
>>>>              >
>>>>              > Best,
>>>>              >
>>>>              > Jim
>>>>              >
>>>>              >
>>>>              >
>>>>              > On 4/24/2014 8:50 AM, Assa Yeroslaviz wrote:
>>>>              >
>>>>              >> Hi,
>>>>              >>
>>>>              >> Is it neccessary to have entrez gene IDs to work with
>>>>              this package?
>>>>              >>
>>>>              >> I am working on a dataset with Ensembl IDs. Do I need
>>>>              to convert them to
>>>>              >> Entrez?
>>>>              >>
>>>>              >> When trying to create a report for a DESeqDataSet or
>>>>              DESeqResults objects
>>>>              >> i
>>>>              >> am getting the error messege:
>>>>              >>
>>>>              >> Error: Ids do not appear to be Entrez Ids for the
>>>>              specified species.
>>>>              >>
>>>>              >> Is there a way to work straight with the ensembl IDs?
>>>>              >>
>>>>              >> Thanks
>>>>              >>
>>>>              >> Assa
>>>>              >>
>>>>              >> my script:
>>>>              >>
>>>>              >> head(Counts_set)
>>>>              >> A_pKO_aV_FCS G_pKO_aV_FCS M_pKO_aV_FCS D_pKO_aV
>>>>              >> J_pKO_aV
>>>>              >> ENSMUSG00000000001 4744         4632         4535 4748
>>>>              >> 3736
>>>>              >> ENSMUSG00000000003    0            0            0  0
>>>>              >>  0
>>>>              >> ENSMUSG00000000028 1246         1420         1429 2304
>>>>              >> 1261
>>>>              >> ENSMUSG00000000031    3           25           65  0
>>>>              >> 50
>>>>              >> ENSMUSG00000000037    0            0            0  0
>>>>              >>  0
>>>>              >> ENSMUSG00000000049    0            0            3  1
>>>>              >>  3
>>>>              >>
>>>>              >> cds <- DESeqDataSetFromMatrix (
>>>>              >>      countData = Counts_set,
>>>>              >>      colData   = colData,
>>>>              >>      design    = ~  condition
>>>>              >>      )
>>>>              >>
>>>>              >> fit = DESeq(cds)
>>>>              >> des2Report <- HTMLReport(shortName
>>>>              =paste('RNAseq_analysis_', group1, "_",
>>>>              >> group2, sep=""),title ='RNA-seq analysis of
>>>>              differential expression using
>>>>              >> DESeq2',reportDirectory = "./reports")
>>>>              >> publish(fit,des2Report,
>>>>              pvalueCutoff=0.05,annotation.db="org.Mm.eg.db",
>>>>              >> factor = colData(fit)$condition,reportDir="./reports")
>>>>              >> Error: Ids do not appear to be Entrez Ids for the
>>>>              specified species.
>>>>              >> finish(des2Report)
>>>>              >>
>>>>              >>
>>>>              >>  sessionInfo()
>>>>              >>>
>>>>              >> R version 3.1.0 (2014-04-10)
>>>>              >> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>              >>
>>>>              >> locale:
>>>>              >>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>              >>   [3] LC_TIME=en_US.UTF-8      LC_COLLATE=en_US.UTF-8
>>>>              >>   [5] LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8
>>>>              >>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>              >>   [9] LC_ADDRESS=C       LC_TELEPHONE=C
>>>>              >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>              >>
>>>>              >> attached base packages:
>>>>              >> [1] parallel  stats graphics  grDevices utils datasets
>>>>               methods
>>>>              >> [8] base
>>>>              >>
>>>>              >> other attached packages:
>>>>              >>   [1] org.Mm.eg.db_2.14.0   ReportingTools_2.4.0
>>>>               AnnotationDbi_1.26.0
>>>>              >>   [4] Biobase_2.24.0    RSQLite_0.11.4          DBI_0.2-7
>>>>              >>   [7] knitr_1.5   DESeq2_1.4.0
>>>>              >>  RcppArmadillo_0.4.200.0
>>>>              >> [10] Rcpp_0.11.1   GenomicRanges_1.16.2  GenomeInfoDb_1.0.2
>>>>              >> [13] IRanges_1.22.3  BiocGenerics_0.10.0
>>>>              >>
>>>>              >> loaded via a namespace (and not attached):
>>>>              >>   [1] annotate_1.42.0    AnnotationForge_1.6.0
>>>>              >> BatchJobs_1.2
>>>>              >>   [4] BBmisc_1.5     BiocParallel_0.6.0
>>>>              >> biomaRt_2.20.0
>>>>              >>   [7] Biostrings_2.32.0    biovizBase_1.12.0
>>>>              >> bitops_1.0-6
>>>>              >> [10] brew_1.0-6   BSgenome_1.32.0
>>>>              >> Category_2.30.0
>>>>              >> [13] cluster_1.14.4   codetools_0.2-8
>>>>              >> colorspace_1.2-4
>>>>              >> [16] dichromat_2.0-0    digest_0.6.4
>>>>              >> edgeR_3.6.0
>>>>              >> [19] evaluate_0.5.3   fail_1.2
>>>>              >> foreach_1.4.2
>>>>              >> [22] formatR_0.10   Formula_1.1-1
>>>>              >> genefilter_1.46.0
>>>>              >> [25] geneplotter_1.42.0   GenomicAlignments_1.0.0
>>>>              >> GenomicFeatures_1.16.0
>>>>              >> [28] ggbio_1.12.0   ggplot2_0.9.3.1
>>>>              >> GO.db_2.14.0
>>>>              >> [31] GOstats_2.30.0   graph_1.42.0
>>>>              >> grid_3.1.0
>>>>              >> [34] gridExtra_0.9.1    GSEABase_1.26.0
>>>>              >> gtable_0.1.2
>>>>              >> [37] Hmisc_3.14-4   hwriter_1.3
>>>>              >> iterators_1.0.7
>>>>              >> [40] lattice_0.20-24    latticeExtra_0.6-26
>>>>              >> limma_3.20.1
>>>>              >> [43] locfit_1.5-9.1   MASS_7.3-29
>>>>              >> Matrix_1.1-2
>>>>              >> [46] munsell_0.4.2    PFAM.db_2.14.0
>>>>              >> plyr_1.8.1
>>>>              >> [49] proto_0.3-10   RBGL_1.40.0
>>>>              >> RColorBrewer_1.0-5
>>>>              >> [52] RCurl_1.95-4.1   reshape2_1.2.2
>>>>              >> R.methodsS3_1.6.1
>>>>              >> [55] R.oo_1.18.0    Rsamtools_1.16.0
>>>>              >> rtracklayer_1.24.0
>>>>              >> [58] R.utils_1.29.8   scales_0.2.4
>>>>              >> sendmailR_1.1-2
>>>>              >> [61] splines_3.1.0    stats4_3.1.0
>>>>              >> stringr_0.6.2
>>>>              >> [64] survival_2.37-7    tools_3.1.0
>>>>              >> VariantAnnotation_1.10.0
>>>>              >> [67] XML_3.98-1.1   xtable_1.7-3
>>>>              >> XVector_0.4.0
>>>>              >> [70] zlibbioc_1.10.0
>>>>              >>
>>>>              >>         [[alternative HTML version deleted]]
>>>>              >>
>>>>              >> _______________________________________________
>>>>              >> Bioconductor mailing list
>>>>              >> Bioconductor at r-project.org
>>>>              <mailto:Bioconductor at r-project.org>
>>>>
>>>>              >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>              >> Search the archives: http://news.gmane.org/gmane.
>>>>              >> science.biology.informatics.conductor
>>>>              >>
>>>>              >
>>>>              > --
>>>>              > James W. MacDonald, M.S.
>>>>              > Biostatistician
>>>>              > University of Washington
>>>>              > Environmental and Occupational Health Sciences
>>>>              > 4225 Roosevelt Way NE, # 100
>>>>              > Seattle WA 98105-6099
>>>>              >
>>>>              >
>>>>
>>>>                      [[alternative HTML version deleted]]
>>>>
>>>>              _______________________________________________
>>>>              Bioconductor mailing list
>>>>              Bioconductor at r-project.org <mailto:Bioconductor at r-project.org
>>>>>
>>>>
>>>>              https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>              Search the archives:
>>>>              http://news.gmane.org/gmane.science.biology.informatics.
>>>> conductor
>>>>
>>>>
>>>>
>>>>
>>>>          --         Gabriel Becker
>>>>          Graduate Student
>>>>          Statistics Department
>>>>          University of California, Davis
>>>>
>>>>
>>>>
>>>>
>>>>      --     Gabriel Becker
>>>>      Graduate Student
>>>>      Statistics Department
>>>>      University of California, Davis
>>>>
>>>>
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099