[BioC] Loading quantiles normalized root data in XPS

Tue Feb 21 21:29:45 CET 2012

Dear Paul,

 From your mail it is not quite clear to me what you mean: where do you 
want to attach which unitname?

Please try the following:
scheme.test3 <- 
root.scheme(paste(.path.package("xps"),"schemes/SchemeTest3.root",sep="/"))
data.test3 <- root.data(scheme.test3, 
paste(.path.package("xps"),"rootdata/DataTest3_cel.root",sep="/"))
id <- indexUnits(data.test3, which="", unitID="*")
id[id[,1]==34,]
Do you mean this?

Best regards
Christian

On 2/21/12 1:23 PM, Paul Geeleher wrote:
> Hi Christian,
>
> Thanks again for your very helpful replies. I think using the C++
> macros might be the way I go once I get everything ironed out.
>
> I have what I think is one last question. When try to convert gene
> symbol to internal ID:
>
> allInternalGeneIds[[i]]<- symbol2unitID(scheme.huex10stv2 ,
> symbol=geneSym, unittype="transcript", as.list =TRUE)
>
> I get the following message:
> slot ‘unitname’ is empty, importing data from ‘HuEx-1_0-st-v2.idx’ ...
>
> Which leads me to believe there is a way to attach "unitname"? I've
> been unable to find how to to this though? I think if you know if this
> is possible it could speed up my pipeline quite a bit, as I have to
> repeat this step for every gene!
>
> Thanks again,
>
> Paul.
>
> On 2/14/12, cstrato<cstrato at aon.at>  wrote:
>> Dear Paul,
>>
>> You could export the whole data as table using:
>>
>> export(data_qn, treename="*", treetype="cqu", varlist="fInten",
>> outfile="data_cqu.txt",as.dataframe=FALSE)
>>
>> Then you have a huge text-file, which you can then use for your
>> purposes. However, since you do not have enough memory to work with R
>> you would need to either import only parts of the table into R, or use
>> other languages such as perl for further analysis.
>>
>> For the DABG data you could use function export.call().
>>
>> Alternatively, you could also work completely from within ROOT, using
>> C++ macros. Examples how to use xps from within ROOT are shown in
>> directory xps/examples/macro4XPS.C
>>
>> Best regards
>> Christian
>>
>>
>> On 2/14/12 1:54 PM, Paul Geeleher wrote:
>>> Hi Christian,
>>>
>>> Thanks for your replies.
>>>
>>> As per your suggestions I've been been able to map between xy
>>> co-ordinates, probe IDs and PROBESET_IDS.
>>>
>>> I've been able to extract the expression level for a given gene (for a
>>> subset of samples) using this:
>>>
>>> id<- symbol2unitID(scheme.huex10stv2 , symbol="CDH11",
>>> unittype="transcript", as.list =TRUE)
>>> treenames<- unlist(treeNames(data_qn))
>>> data_qn<- attachInten(data_qn, treenames=treenames[1:3])
>>> xyIntens<- validData(data_qn, unitID=unlist(id))
>>>
>>> With "validData()" is that it seems I can only get the expression
>>> levels for samples which are "attached". This means I'd probably have
>>> to attach each sample separately (which will take a long time) as I
>>> don't have enough memory to attach all samples.
>>>
>>> I'm wondering if there is a way to extract this data without having to
>>> attach the data? If something similar is possible for the dabg data
>>> (at probeset level) that would also be a big help in speeding up my
>>> pipeline (which will be complete if I can implement these last two
>>> steps)!
>>>
>>> Thanks again for all your help its much appreciated,
>>>
>>> Paul.
>>>
>>>
>>> On Tue, Feb 7, 2012 at 8:48 PM, cstrato<cstrato at aon.at>   wrote:
>>>> Dear Paul,
>>>>
>>>> see replies below:
>>>>
>>>>
>>>> On 2/7/12 1:52 PM, Paul Geeleher wrote:
>>>>>
>>>>> Hi Christian, apologies for the lack of clarity in my previous email,
>>>>> I'll try to clear this up, see replies below:
>>>>>
>>>>> On Mon, Feb 6, 2012 at 7:51 PM, cstrato<cstrato at aon.at>     wrote:
>>>>>>
>>>>>> Dear Paul,
>>>>>>
>>>>>> I am afraid that it is not quite clear to me what you want to do.
>>>>>>
>>>>>> What do you mean with "obtain the expression levels of the probes"? Do
>>>>>> you
>>>>>> mean "probes" (i.e. oligos) or do you mean "probesets" as in your
>>>>>> question
>>>>>> about DABG?
>>>>>
>>>>>
>>>>> I've read in and quantiles normalized probe level data for 176 arrays
>>>>> and what I'm trying to do now is set every probe (individual probe,
>>>>> not probeset) that is not detected above background to zero. But from
>>>>> what I've read dabg.call() can only create p-values for probeset
>>>>> level? So as compromise I'm now trying to set the expression of all
>>>>> *individual probes* which belong to a *probeset* not detected above
>>>>> background to zero. I hope that makes sense.
>>>>>
>>>>
>>>>
>>>> Did you read the exon array whitepapers which you can download from
>>>> Affymetrix?
>>>>
>>>> Every probeset consists of 1-4 probes only, and every exon consists
>>>> usually
>>>> of 1-2 probesets. Each gene has a transcript_cluster_id, which consists
>>>> of
>>>> one or more probeset_ids. (You can see the mapping between ids using
>>>> function export.scheme(..,treetype="anp",..)
>>>>
>>>> Since the smallest unit is the probeset, function dabg.call() will only
>>>> work
>>>> at the probeset (and transcript) level. If you set all probes of a
>>>> probeset
>>>> to zero you may loose an entire exon.
>>>>
>>>>
>>>>
>>>>>>
>>>>>> How did you create "data_hapMap"? Maybe you could send me your complete
>>>>>> code
>>>>>> otherwise I am only able to guess.
>>>>>
>>>>>
>>>>> "data_hapMap" was a mistake, sorry I copied the wrong object name it
>>>>> should have been "data_qn" which is quantiles normalized probe level
>>>>> data that I created like this:
>>>>>
>>>>> data_hapMap<- root.data(scheme.huex10stv2,
>>>>> "/data2/paul/normalization_project/root_data/hapMap_root_data_cel.root")
>>>>> #read raw data
>>>>> data_qn<- normalize.quantiles(data_hapMap, "exon_quantiles",
>>>>> filedir=rootdata_Dir, exonlevel="all") #quantiles normalize at *probe*
>>>>> level
>>>>>
>>>>> So "data_qn" is the object I want to work with, based on your advice
>>>>> I've figured out that to access the expression levels in "data_qn" I
>>>>> need to use "attachInten()", then I can use "intensities()" to access
>>>>> the expression levels:
>>>>>
>>>>> treenames<- unlist(treeNames(data_qn))
>>>>> data_qn<- attachInten(data_qn, treenames=treenames[1:2]) #attach the
>>>>> first two samples
>>>>>
>>>>>> head(intensity(data_qn))
>>>>>
>>>>>     X Y GSM188869.cqu_MEAN GSM188870.cqu_MEAN
>>>>> 1 0 0             0.0000              0.000
>>>>> 2 1 0             0.0000              0.000
>>>>> 3 2 0             0.0000              0.000
>>>>> 4 3 0             0.0000              0.000
>>>>> 5 4 0             0.0000              0.000
>>>>> 6 5 0            85.3523            121.733
>>>>>
>>>>> Does the X column in the output above represent the probe ID? If so
>>>>> and I have a mapping of probeset IDs to corresponding probe IDs, it
>>>>> should be fairly straightforward to set probes that are not detected
>>>>> above background to zero? Perhaps there is a more straightforward way
>>>>> of doing this though?
>>>>>
>>>>
>>>> No, (X,Y) are the coordinates of the single probes on the exon array. You
>>>> can use function export.scheme(..,treetype="scm",..) to get the mapping
>>>> between (X,Y) and an internal PROBESET_ID, e.g.:
>>>>
>>>> UNIT_ID X       Y       ProbeLength     Mask    EXON_ID PROBESET_ID
>>>> 31      986     1674    25      512     31      31
>>>> 31      1092    677     25      512     31      31
>>>> 31      796     1862    25      512     31      31
>>>> 31      917     193     25      512     31      31
>>>> 31      341     1677    25      512     32      32
>>>> 31      144     2250    25      512     32      32
>>>> 31      689     262     25      512     32      32
>>>> 31      579     1670    25      512     32      32
>>>>
>>>> Then you can use export.scheme(..,treetype="pbs",..) to map PROBESET_ID
>>>> (=UNIT_ID) to the ProbesetID, e.g.:
>>>>
>>>> UNIT_ID ProbesetID      NumCells        NumAtoms        NumSubunits
>>>> UnitType
>>>> 31      2315101 4       4       1       512
>>>> 32      2315102 4       4       1       512
>>>>
>>>> As you see PROBESET_IDs 31 and 32 have each 4 probes and belong to the
>>>> Affymetrix ProbesetIDs 2315101 and 2315102, respectively.
>>>>
>>>> You could also use functions indexUnits(), and unitID2probesetID() or
>>>> unitID2transcriptID(), respectively.
>>>>
>>>> Best regards
>>>> Christian
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> I guess that "data_hapMap" contains the raw data. For these the slot
>>>>>> "data"
>>>>>> is empty to save memory. So you need to use either attachData() or
>>>>>> attachInten(). However since you are using exon arrays you may not have
>>>>>> enough RAM, so it would be better to use function export() or
>>>>>> export.data(),
>>>>>> or attach only a subset, see help ?attachData. See also vignette
>>>>>> xps.pdf
>>>>>> (chapter 2.3).
>>>>>
>>>>>
>>>>> I think RAM shouldn't be an issue if I attach the samples one at a
>>>>> time? (I actually have access to a machine with 32gigs RAM but ideally
>>>>> would like to get what I'm doing to run on a standard desktop, which
>>>>> is actually why I'm using XPS!).
>>>>>
>>>>>>
>>>>>> When you talk about "expression matrix", how did you create it? Maybe
>>>>>> you
>>>>>> could use function validExpr(), but w/o seeing your code it is hard to
>>>>>> tell.
>>>>>> For DABG there are functions pvalData() and presCall(), see the
>>>>>> examples
>>>>>> in
>>>>>> help ?dabg.call.
>>>>>
>>>>>
>>>>> Yes I've managed to use dabg.call() at probeset level and access the
>>>>> p-values using pvalData() alright.
>>>>>
>>>>> Thanks again for all of your help and patience!
>>>>>
>>>>> Kind Regards,
>>>>>
>>>>> Paul.
>>>>>
>>>>>
>>>>>>
>>>>>> Best regards
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/6/12 4:33 PM, Paul Geeleher wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Thanks for your quick and informative reply.
>>>>>>>
>>>>>>> I have re-run the analysis and saved the R objects as you suggest. The
>>>>>>> next thing I'm trying to do is to obtain the expression levels of the
>>>>>>> probes, but this doesn't seem to be working for me:
>>>>>>>
>>>>>>>> a<- validData(data_hapMap)
>>>>>>>
>>>>>>>
>>>>>>> Error in .local(object, ...) : slot "data" has no data
>>>>>>>
>>>>>>> Based on the documentation I think validData() is the correct
>>>>>>> function.
>>>>>>>
>>>>>>> I've also performed probeset level DABG and I'm trying to set
>>>>>>> individual probes which belong to probesets with DABG<       .05 to 0
>>>>>>> in
>>>>>>> the expression matrix.
>>>>>>>
>>>>>>> But it seems I can't see the expression matrix using validData().
>>>>>>> Perhaps there is another function. Any ideas?
>>>>>>>
>>>>>>> Thank you again for your help with this, I'm very grateful!
>>>>>>>
>>>>>>> Paul.
>>>>>>>
>>>>>>> On 2/2/12, cstrato<cstrato at aon.at>       wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear Paul,
>>>>>>>>
>>>>>>>> The functions root.data(), root.call() and root.expr() were created
>>>>>>>> to
>>>>>>>> allow you access to the corresponding root files just in case that
>>>>>>>> you
>>>>>>>> did not save your R session.
>>>>>>>>
>>>>>>>> In the cases where you compute expression levels stepwise, or only
>>>>>>>> part
>>>>>>>> of them such as normalize.quantiles(), as seems to be the matter in
>>>>>>>> your
>>>>>>>> case, there is no corresponding root.xxx() function to access the
>>>>>>>> root
>>>>>>>> file directly. In these cases you need to save your R session to have
>>>>>>>> continued access to the resulting root file.
>>>>>>>>
>>>>>>>> Please note that saving the R session is the usual case to have
>>>>>>>> access
>>>>>>>> to the root files.
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Christian
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/2/12 1:12 PM, Paul Geeleher wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Christian,
>>>>>>>>>
>>>>>>>>> Thanks for your quick reply. I check what kind of trees I have using
>>>>>>>>> "getTreeNames()" as you'd suggested, it seems they are of type "cqu"
>>>>>>>>> rather than "int", this is presumably because my analysis required
>>>>>>>>> no
>>>>>>>>> background correction step?
>>>>>>>>>
>>>>>>>>> So I then tried:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root",
>>>>>>>>>> "cqu")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> but that gives me a huge number of errors that look like this:
>>>>>>>>>
>>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>>>> Error: Could not get tree<ExportSet>.
>>>>>>>>> Error in root.expr(scheme.huex10stv2, "exon_quantiles.root",  :
>>>>>>>>>      error in function ‘ExportData’
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This file "exon_quantiles.root" definitely exists in the current
>>>>>>>>> working directory though... Thanks again for your help!
>>>>>>>>>
>>>>>>>>> Paul.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 1, 2012 at 9:01 PM, cstrato<cstrato at aon.at>        wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dear Paul,
>>>>>>>>>>
>>>>>>>>>> Please have a look at the help ?root.expr.
>>>>>>>>>>
>>>>>>>>>> If I understand you correctly, you did only do quantile
>>>>>>>>>> normalization?
>>>>>>>>>>
>>>>>>>>>> To see the tree names in your file you should do:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> getTreeNames("exon_quantiles.root")
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You will probably see trees with extension "int", see help
>>>>>>>>>> ?validTreetype.
>>>>>>>>>>
>>>>>>>>>> To load these trees you need to do:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root",
>>>>>>>>>>> "int")
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please let me know if this did solve your problem.
>>>>>>>>>>
>>>>>>>>>> Best regards
>>>>>>>>>> Christian
>>>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>>>> C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
>>>>>>>>>> V.i.e.n.n.a           A.u.s.t.r.i.a
>>>>>>>>>> e.m.a.i.l:        cstrato at aon.at
>>>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2/1/12 7:07 PM, Paul Geeleher wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I've used xps to quantiles normalize (at probe level) some Affy
>>>>>>>>>>> Exon
>>>>>>>>>>> Array data. I now have a "root" file called "exon_quantiles.root",
>>>>>>>>>>> but
>>>>>>>>>>> if I try to load it the same was I'd load my raw data (using the
>>>>>>>>>>> scheme file I created for Affy exon arrays) I get the error below?
>>>>>>>>>>> I
>>>>>>>>>>> can load my raw data just fine though. Any ideas? Do I perhaps
>>>>>>>>>>> need
>>>>>>>>>>> a
>>>>>>>>>>> different "root scheme" file for this normalized data?
>>>>>>>>>>> Unfortunately,
>>>>>>>>>>> I haven't been able to find an answer.
>>>>>>>>>>>
>>>>>>>>>>>> scheme.huex10stv2<- root.scheme("huex10stv2.root")
>>>>>>>>>>>> data_qn<- root.data(scheme.huex10stv2, "exon_quantiles.root")
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Error in if (chipname != treetitle) { : argument is of length zero
>>>>>>>>>>>
>>>>>>>>>>> Hope someone can help,
>>>>>>>>>>>
>>>>>>>>>>> Paul.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> sessionInfo()
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> R version 2.11.0 (2010-04-22)
>>>>>>>>>>> x86_64-redhat-linux-gnu
>>>>>>>>>>>
>>>>>>>>>>> locale:
>>>>>>>>>>>     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>>>>>>     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>>>>>>     [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>>>>>>     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>>>>>>     [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>>>>
>>>>>>>>>>> attached base packages:
>>>>>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods
>>>>>>>>>>> base
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>