[BioC] [devteam-bioc] getGmt error

Martin Morgan mtmorgan at fhcrc.org
Fri Sep 6 15:17:01 CEST 2013


On 09/05/2013 10:52 PM, Maintainer wrote:
>
> When I write GMT file into R,like
>> C2allBroadSets <- getGmt("c2.all.v4.0.orig.gmt")
> Error in GeneSetCollection(lapply(lines, function(line) { :
>    error in evaluating the argument 'object' in selecting a method for function 'GeneSetCollection': Error in validObject(.Object) :
>    invalid class "GeneSet" object: gene symbols must be unique

the problem is that c2.all.v4.0.orig.gmt (from 
http://www.broadinstitute.org/gsea/msigdb/collections.jsp) is poorly formed. I 
did (the output is edited)

 > options(error=recover)
 > xx = getGmt("c2.all.v4.0.orig.gmt")

Enter a frame number, or 0 to exit

  1: getGmt("c2.all.v4.0.orig.gmt")
  2: GeneSetCollection(lapply(lines, function(line) {
     GeneSet(unlist(line[-(1
  3: lapply(lines, function(line) {
     GeneSet(unlist(line[-(1:2)]), geneIdType
  4: FUN(X[[4694]], ...)
  5: GeneSet(unlist(line[-(1:2)]), geneIdType = geneIdType, collectionType = col
  6: GeneSet(unlist(line[-(1:2)]), geneIdType = geneIdType, collectionType = col
  7: do.call(new, c("GeneSet", list(geneIds = type), list(... = ..., setIdentifi
  8: (function (Class, ...)
{
     ClassDef <- getClass(Class, where = topenv(pare
  9: initialize(value, ...)
10: initialize(value, ...)
11: .local(.Object, ...)
12: callNextMethod(.Object, .Template, ..., setIdentifier = mkScalar(setIdentif
13: eval(call, callEnv)
14: eval(expr, envir, enclos)
15: .nextMethod(.Object, .Template, ..., setIdentifier = mkScalar(setIdentifier
16: validObject(.Object)

Selection:

line 4 gives a hint that the problem in in line ~ 4694 of the file. I then 
responded with

Selection: 16
Called from: top level
Browse[1]> getValidity(getClass("GeneSet"))
function (object)
{
     if (any(duplicated(geneIds(object))))
         "gene symbols must be unique"
     else TRUE
}
<environment: namespace:GSEABase>
Browse[1]> geneIds(object)[which(duplicated(geneIds(object)))]
[1] "NM_009369"

and then verified that in the original file this is indeed the only line with a 
duplicated identifier

 > txt = readLines("c2.all.v4.0.orig.gmt")
 > fld = strsplit(txt, "\t")
 > dups = sapply(fld, function(x) any(table(x) != 1))
 > which(dups)
[1] 4694

The short term solution is to edit c2.all.v4.0.orig.gmt to remove the duplicate 
entry

   txt[4694] = sub("NM_009369\t", "", txt[4694])
   writeLines(txt, "c2.all.v4.0.orig_MODIFIED_.gmt")

the longer term solution is to report the problem to the MSigDB maintainers.

Martin

>
> how to fix it out?
>
>   -- output of sessionInfo():
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
> [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
> [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
> [4] LC_NUMERIC=C
> [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
>
> attached base packages:
> [1] splines   grid      parallel  stats     graphics  grDevices utils
> [8] datasets  methods   base
>
> other attached packages:
> [1] GSVA_1.8.0                       GSVAdata_0.99.10
> [3] hgu95a.db_2.9.0                  hgu133plus2hsentrezgprobe_17.1.0
> [5] hgu133plus2hsentrezgcdf_17.1.0   hgu133plus2hsentrezg.db_17.1.0
> [7] hgu95av2.db_2.9.0                a4Classif_1.8.0
> [9] varSelRF_0.7-3                   randomForest_4.6-7
> [11] pamr_1.54.1                      survival_2.37-4
> [13] ROCR_1.0-5                       gplots_2.11.3
> [15] KernSmooth_2.23-10               caTools_1.14
> [17] gdata_2.13.2                     gtools_3.0.0
> [19] MLInterfaces_1.40.0              sfsmisc_1.0-24
> [21] cluster_1.14.4                   rda_1.0.2-2
> [23] rpart_4.1-3                      MASS_7.3-29
> [25] a4Preproc_1.8.0                  a4Core_1.8.0
> [27] glmnet_1.9-5                     Matrix_1.0-12
> [29] lattice_0.20-23                  GSEABase_1.22.0
> [31] affy_1.38.1                      GOstats_2.26.0
> [33] graph_1.38.3                     Category_2.26.0
> [35] VennDiagram_1.6.5                pheatmap_0.7.6
> [37] statmod_1.4.17                   limma_3.16.7
> [39] biomaRt_2.16.0                   annotate_1.38.0
> [41] genefilter_1.42.0                primeviewhsentrezgprobe_17.1.0
> [43] primeviewhsentrezg.db_17.1.0     org.Hs.eg.db_2.9.0
> [45] RSQLite_0.11.4                   DBI_0.2-7
> [47] primeviewhsentrezgcdf_17.1.0     AnnotationDbi_1.22.6
> [49] Biobase_2.20.1                   BiocGenerics_0.6.0
> [51] rj_1.1.3-1
>
> loaded via a namespace (and not attached):
> [1] affyio_1.28.0         AnnotationForge_1.2.2 BiocInstaller_1.10.3
> [4] bitops_1.0-6          GO.db_2.9.0           IRanges_1.18.3
> [7] mboost_2.2-2          preprocessCore_1.22.0 RBGL_1.36.2
> [10] RCurl_1.95-4.1        rj.gd_1.1.3-1         stats4_3.0.1
> [13] tools_3.0.1           XML_3.98-1.1          xtable_1.7-1
> [16] zlibbioc_1.6.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list