[BioC] package xps: export.filter issues

Thu Aug 30 20:28:39 CEST 2012

Dear Steven,

It seems, that adding fName does indeed cause problems, probably due to 
the fact that the Affymetrix gene annotation does contain characters 
which interfere with R data.frame. However, adding fSymbol only is fine.

Here is my example:

 > tmp <- validData(rma.ufr, which="UnitName")
 > dim(tmp)
[1] 54675     9

 > tmp <- export.filter(rma.ufr, treename = "*", treetype = "stt", 
varlist = "fUnitName:fc:pval:flag", as.dataframe = TRUE, verbose = TRUE)
Warning: tree <UniTest.ufr> does not exist or has not <54675> entries.>
 > dim(tmp)
[1] 54675     5

 > tmp <- export.filter(rma.ufr, treename = "*", treetype = "stt", 
varlist = "fUnitName:fSymbol:fc:pval:flag", as.dataframe = TRUE, verbose 
= TRUE)
Warning: tree <UniTest.ufr> does not exist or has not <54675> 
entries.Reading entries from <HG-U133_Plus_2.ann> ...Finished
 > dim(tmp)
[1] 54675     6

 > tmp <- export.filter(rma.ufr, treename = "*", treetype = "stt", 
varlist = "fUnitName:fName:fSymbol:fc:pval:flag", as.dataframe = TRUE, 
verbose = TRUE)
Warning: tree <UniTest.ufr> does not exist or has not <54675> 
entries.Reading entries from <HG-U133_Plus_2.ann> ...Finished
dim(tmp)
#[1] 28298     7

As you see, everything is fine as long as fName is not included. (You 
can ignore the warning)

As you have already seen, when you simply export the results, you will 
get the complete output:

 > export.filter(rma.ufr, treename = "*", treetype = "stt", varlist = 
"fUnitName:fName:fSymbol:fc:pval:flag", outfile = "UniFltr.txt", sep = 
"\t", as.dataframe = FALSE, verbose = TRUE)

If you try  to import table "UniFltr.txt" as data.frame you get the same 
problem:

 > tmp <- read.table("UniFltr.txt", header=TRUE, row.names=NULL, 
sep="\t", check.names=FALSE, stringsAsFactors=FALSE, comment.char="")
 > dim(tmp)
[1] 28298     7

You see that even setting read.table(...,comment.char="",...) does not 
help. Thus the only option is to use fSymbol only but not fName.

Since you only need to get the gene symbols there is no need to use 
hgu133plus2.db, as my example above has shown.

I hope this does help.

Best regards,
Christian

On 8/17/12 3:05 PM, steven wink wrote:
 > Dear Christian,
 >
 > Thank you for your reply. I noticed the problem occurs when loading the
 > data in R memory as a dataframe. Writing to a file directly: no problem.
 > Loading from file to R memory it gets messed up again. The dataframe
 > without the gene annotation is fine.
 > I wanted the full set in a data.frame to select a set of genes with
 > foldchanges etc. For now I used the bioc package hgu133plus2.db instead
 > of using xps: export.filter to attach gene symbols.
 >
 > I saw one warning:
 > Warning: tree <UniTest.ufr> does not exist or has not <54675>
 > entries.Reading entries from <HG-U133_Plus_2.ann> ...Finished
 >
 > I also followed the code on your email. Same effect.
 >
 > root version
 > Version   5.32/04      13 July 2012
 >
 > xps 1.16.0
 > =
 > netaffx-annotation-date=2011-06-22
 > =
 > sessionInfo()
 > R version 2.15.1 (2012-06-22)
 > Platform: x86_64-pc-linux-gnu (64-bit)
 >
 > locale:
 >   [1] LC_CTYPE=nl_NL.UTF-8       LC_NUMERIC=C
 >   [3] LC_TIME=nl_NL.UTF-8        LC_COLLATE=nl_NL.UTF-8
 >   [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=nl_NL.UTF-8
 >   [7] LC_PAPER=C                 LC_NAME=C
 >   [9] LC_ADDRESS=C               LC_TELEPHONE=C
 > [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
 >
 > attached base packages:
 > [1] stats     graphics  grDevices utils     datasets  methods   base
 >
 > other attached packages:
 > [1] rjson_0.2.9          hgu133plus2.db_2.7.1 org.Hs.eg.db_2.7.1
 > [4] RSQLite_0.11.1       DBI_0.2-5            AnnotationDbi_1.18.1
 > [7] Biobase_2.16.0       BiocGenerics_0.2.0   xps_1.16.0
 >
 > loaded via a namespace (and not attached):
 > [1] IRanges_1.14.4    probemapper_1.0.0 stats4_2.15.1     tools_2.15.1
 > =
 >
 > importing the previously created scheme file:
 >
 > scmdir<-"/media/Working Drive/Promotie/mArray biomarker 
study/Schemes/na32"
 > scheme<-root.scheme(file.path(scmdir,"hgu133plus2.root"))
 > class(scheme)
 > [1] "SchemeTreeSet"
 > attr(,"package")
 > [1] "xps"
 >
 >   library(xps)
 > celdir<-file.path("/media/Working Drive/Promotie/mArray biomarker
 > study/Lisa Paper/carbamazepine.Human.in_vitro.Liver/24 low")
 > 
celfiles<-c("carbamazepine_ctrl_24h_1.CEL","carbamazepine_ctrl_24h_2.CEL","carbamazepine_l_24h_1.CEL","carbamazepine_l_24h_2.CEL")
 > 
data.carb<-import.data(scheme,"tmpdt_dataCarb",celdir=celdir,celfiles=celfiles,verbose=FALSE)
 > data.rma<-rma(data.carb,"tmpdt_rma",verbose=FALSE)
 > png("pca_carb_24h_low_PHH.png",width=800,height=800)
 >  >
 > 
pcaplot(data.rma,group=c("ctrl","ctrl","l24h","l24h"),add.labels=TRUE,add.legend=TRUE)
 >  > dev.off()
 > null device
 >
 > unifltr<-UniFilter(foldchange=c(1.2,"both"),unifilter=c(0.1,"pval"))
 > uniTest(unifltr) <- c("t.test","two.sided","BH",0,0.0,FALSE,0.95,TRUE)
 >  > str(unifltr)
 > Formal class 'UniFilter' [package "xps"] with 5 slots
 >    ..@ foldchange:List of 2
 >    .. ..$ cutoff   : num 1.2
 >    .. ..$ direction: chr "both"
 >    ..@ prescall  : list()
 >    ..@ unifilter :List of 2
 >    .. ..$ cutoff  : num 0.1
 >    .. ..$ variable: chr "pval"
 >    ..@ unitest   :List of 8
 >    .. ..$ type       : chr "t.test"
 >    .. ..$ alternative: chr "two.sided"
 >    .. ..$ correction : chr "BH"
 >    .. ..$ numperm    : int 0
 >    .. ..$ mu         : num 0
 >    .. ..$ paired     : logi FALSE
 >    .. ..$ conflevel  : num 0.95
 >    .. ..$ varequ     : logi TRUE
 >    ..@ numfilters: num 2
 >
 >
 > rma.ufr<-unifilter(data.rma,"tmpdt_rmaufr5",getwd(),unifltr,
 > group=c("ctrl","ctrl","l24h","l24h"),verbose=FALSE)
 >
 > 
tmp<-export.filter(rma.ufr,treetype="stt",varlist="fUnitName:fName:fSymbol:fc:pval:flag",
 > + as.dataframe=TRUE,verbose=FALSE)
 >
 >
 > tmp[3670,]
 >       UNIT_ID  UnitName    GeneName GeneSymbol  P-Value FoldChange Flag
 > 3670    5641 206054_at kininogen 1       KNG1 0.369156    1.05887    0
 >
 >
 > Bellow I include a small portion of the output from row 3671. (In my
 > text file some blocks/ multiple rows of character/strings are quoted,
 > then blocks contain strings which are not.)
 >
 > tmp[3671, ]
 > 72\t0\n8438\t208882_s_at\tubiquitin protein ligase E3 component
 > n-recognin 5\tUBR5\t0.874644\t1.00482\t0\n8439\t208883_at\tubiquitin
 > protein ligase E3 component n-recognin
 > 5\tUBR5\t0.770298\t0.967776\t0\n8440\t208884_s_at\tubiquitin protein
 > ligase E3 component n-recognin
 > 5\tUBR5\t0.253671\t1.04481\t0\n8441\t208885_at\tlymphocyte cytosolic
 > protein 1 (L-plastin)\tLCP1\t0.977804\t1.00235\t0\n8442\t208886_at\tH1
 > histone family, member
 > 0\tH1F0\t0.500673\t0.955601\t0\n8443\t208887_at\teukaryotic translation
 > initiation factor 3, subunit
 > G\tEIF3G\t0.0389222\t1.05615\t0\n8444\t208888_s_at\tnuclear receptor
 > corepressor 2\tNCOR2\t0.372557\t1.22396\t0\n8445\t208889_s_at\tnuclear
 > receptor corepressor
 > 2\tNCOR2\t0.272628\t1.0801\t0\n8446\t208890_s_at\tplexin
 > B2\tPLXNB2\t0.0275446\t1.0633\t0\n8447\t208891_at\tdual specificity
 > phosphatase 6\tDUSP6\t0.837928\t0.981868\t0\n8448\t208892_s_at\tdual
 > specificity phosphatase
 > 6\tDUSP6\t0.491362\t0.932852\t0\n8449\t208893_s_at\tdual specificity
 > phosphatase 6\tDUSP6\t0.359966\t0.964169\t0\n8450\t208894_at\tmajor
 > histocompatibility complex, class II, DR
 > alpha\tHLA-DRA\t0.9963\t0.998907\t0\n8451\t208895_s_at\tDEAD
 > (Asp-Glu-Ala-Asp) box polypeptide
 > 18\tDDX18\t0.265118\t1.03538\t0\n8452\t208896_at\tDEAD (Asp-Glu-Ala-Asp)
 > box polypeptide
 > 18\tDDX18\t0.124475\t0.955801\t0\n8453\t208897_s_at\tDEAD
 > (Asp-Glu-Ala-Asp) box polypeptide
 > 18\tDDX18\t0.11651\t1.05331\t0\n8454\t208898_at\tATPase, H+
 > transporting, lysosomal 34kDa, V1 subunit
 > D\tATP6V1D\t0.118903\t0.970489\t0\n8455\t208899_x_at\tATPase, H+
 > transporting, lysosomal 34kDa, V1 subunit
 > D\tATP6V1D\t0.738698\t0.995617\t0\n8456\t208900_s_at\ttopoisomerase
 > (DNA) I\tTOP1\t0.660803\t0.965493\t0\n8457\t208901_s_at\ttopoisomerase
 > (DNA) I\tTOP1\t0.140795\t0.963365\t0\n8458\t208902_s_at\tribosomal
 > protein S28\tRPS28\t0.939403\t1.00859\t0\n8459\t208903_at\tribosomal
 > protein S28\tRPS28\t0.720425\t0.953657\t0\n8460\t208904_s_at\tribosomal
 > protein S28\tRPS28\t0.118968\t0.958277\t0\n8461\t208905_at\tcytochrome
 > c,
 > somatic\tCYCS\t0.801304\t0.993872\t0\n8462\t208906_at\tBerardinelli-Seip
 > congenital lipodystrophy 2
 > (seipin)\tBSCL2\t0.156978\t1.05388\t0\n8463\t208907_s_at\tmitochondrial
 > ribosomal protein
 > 
S18B\tMRPS18B\t0.113534\t1.02395\t0\n8464\t208908_s_at\tcalpastatin\tCAST\t0.067195\t1.03521\t0\n8465\t208909_at\tubiquinol-cytochrome
 > c reductase, Rieske iron-sulfur polypeptide
 > 1\tUQCRFS1\t0.570856\t1.0093\t0\n8466\t208910_s_at\tcomplement component
 > 1, q subcomponent binding
 > protein\tC1QBP\t0.0812231\t0.958867\t0\n8467\t208911_s_at\tpyruvate
 > dehydrogenase (lipoamide)
 > beta\tPDHB\t0.351079\t1.01438\t0\n8468\t208912_s_at\t2,3-cyclic
 > nucleotide 3 phosphodiesterase
 >       GeneSymbol  P-Value FoldChange Flag
 > 3671        CNP 0.886081   0.991868    0
 >
 >
 >
 >
 >
 >
 > ================================
 > following Christian's mail
 > ================================
 > datdir<-"/media/Working Drive/Promotie/mArray biomarker study/Lisa
 > Paper/selection of genes/carb data"
 >  >
 > 
data.carb<-import.data(scheme,"dataCarbamazepine",filedir=datdir,celdir=celdir,
 > celfiles=celfiles)
 >  >
 > 
data.rma<-rma(data.carb,"dataCarbamazepineRMA",tmpdir="",background="pmonly",normalize
 > = TRUE)
 >   expr.rma<-validData(data.rma)
 > export.expr(data.rma, treename="*", treetype="mdp",varlist ="*",
 > outfile="24hoursLow_carb_PHH.txt", sep="\t", as.dataframe=FALSE, verbose
 > = TRUE)
 > unifltr <- UniFilter(unitest=c("t.test", "two.sided", "BH", 0, 0.0,
 > FALSE, 0.95, TRUE))
 > rma.ufr<-unifilter(data.rma,"dataCarbUnifilter",getwd(),unifltr,
 > group=c("ctrl","ctrl","l24h","l24h"))
 > export.filter(rma.ufr, treename="*",treetype="stt", varlist =
 > "fUnitName:fName:fSymbol:fc:pval:flag", outfile = "UniFltr.txt", sep =
 > "\t", as.dataframe = FALSE, verbose = TRUE)
 > =verbose=
 > Opening file </media/Working Drive/Promotie/mArray biomarker
 > study/Schemes/na32/hgu133plus2.root> in <READ> mode...
 > Opening file </media/Working Drive/Promotie/mArray biomarker study/Lisa
 > Paper/selection of genes/dataCarbUnifilter_ufr.root> in <READ> mode...
 > Opening file </media/Working Drive/Promotie/mArray biomarker study/Lisa
 > Paper/selection of genes/dataCarbUnifilter_ufr.root> in <READ> mode...
 > Exporting data from tree <*> to file <UniFltr.txt>...
 > Warning: tree <UniTest.ufr> does not exist or has not <54675>
 > entries.Reading entries from <HG-U133_Plus_2.ann> ...Finished
 > NULL
 >  >
 > ==
 >
 > tmp<-validData(rma.ufr, which="UnitName")
 >   tmp <- export.filter(rma.ufr, treename = "*", treetype = "stt",
 > varlist = "fUnitName:fName:fSymbol:fc:pval:flag", as.dataframe = TRUE,
 > verbose = TRUE)
 >
 > Having done this the text file "UniFltr.txt" looks fine. However:
 >   test<-read.table("UniFltr.txt",sep="\t",header=TRUE)
 >  > test[3671,2]
 > [1] 206055_s_at
 > 28298 Levels: 1007_s_at 1053_at 117_at 1553075_a_at 1553995_a_at ...
 > AFFX-TrpnX-M_at