[BioC] Loading quantiles normalized root data in XPS

Tue Feb 14 20:55:08 CET 2012

Dear Paul,

You could export the whole data as table using:

export(data_qn, treename="*", treetype="cqu", varlist="fInten", 
outfile="data_cqu.txt",as.dataframe=FALSE)

Then you have a huge text-file, which you can then use for your 
purposes. However, since you do not have enough memory to work with R 
you would need to either import only parts of the table into R, or use 
other languages such as perl for further analysis.

For the DABG data you could use function export.call().

Alternatively, you could also work completely from within ROOT, using 
C++ macros. Examples how to use xps from within ROOT are shown in 
directory xps/examples/macro4XPS.C

Best regards
Christian

On 2/14/12 1:54 PM, Paul Geeleher wrote:
> Hi Christian,
>
> Thanks for your replies.
>
> As per your suggestions I've been been able to map between xy
> co-ordinates, probe IDs and PROBESET_IDS.
>
> I've been able to extract the expression level for a given gene (for a
> subset of samples) using this:
>
> id<- symbol2unitID(scheme.huex10stv2 , symbol="CDH11",
> unittype="transcript", as.list =TRUE)
> treenames<- unlist(treeNames(data_qn))
> data_qn<- attachInten(data_qn, treenames=treenames[1:3])
> xyIntens<- validData(data_qn, unitID=unlist(id))
>
> With "validData()" is that it seems I can only get the expression
> levels for samples which are "attached". This means I'd probably have
> to attach each sample separately (which will take a long time) as I
> don't have enough memory to attach all samples.
>
> I'm wondering if there is a way to extract this data without having to
> attach the data? If something similar is possible for the dabg data
> (at probeset level) that would also be a big help in speeding up my
> pipeline (which will be complete if I can implement these last two
> steps)!
>
> Thanks again for all your help its much appreciated,
>
> Paul.
>
>
> On Tue, Feb 7, 2012 at 8:48 PM, cstrato<cstrato at aon.at>  wrote:
>> Dear Paul,
>>
>> see replies below:
>>
>>
>> On 2/7/12 1:52 PM, Paul Geeleher wrote:
>>>
>>> Hi Christian, apologies for the lack of clarity in my previous email,
>>> I'll try to clear this up, see replies below:
>>>
>>> On Mon, Feb 6, 2012 at 7:51 PM, cstrato<cstrato at aon.at>    wrote:
>>>>
>>>> Dear Paul,
>>>>
>>>> I am afraid that it is not quite clear to me what you want to do.
>>>>
>>>> What do you mean with "obtain the expression levels of the probes"? Do
>>>> you
>>>> mean "probes" (i.e. oligos) or do you mean "probesets" as in your
>>>> question
>>>> about DABG?
>>>
>>>
>>> I've read in and quantiles normalized probe level data for 176 arrays
>>> and what I'm trying to do now is set every probe (individual probe,
>>> not probeset) that is not detected above background to zero. But from
>>> what I've read dabg.call() can only create p-values for probeset
>>> level? So as compromise I'm now trying to set the expression of all
>>> *individual probes* which belong to a *probeset* not detected above
>>> background to zero. I hope that makes sense.
>>>
>>
>>
>> Did you read the exon array whitepapers which you can download from
>> Affymetrix?
>>
>> Every probeset consists of 1-4 probes only, and every exon consists usually
>> of 1-2 probesets. Each gene has a transcript_cluster_id, which consists of
>> one or more probeset_ids. (You can see the mapping between ids using
>> function export.scheme(..,treetype="anp",..)
>>
>> Since the smallest unit is the probeset, function dabg.call() will only work
>> at the probeset (and transcript) level. If you set all probes of a probeset
>> to zero you may loose an entire exon.
>>
>>
>>
>>>>
>>>> How did you create "data_hapMap"? Maybe you could send me your complete
>>>> code
>>>> otherwise I am only able to guess.
>>>
>>>
>>> "data_hapMap" was a mistake, sorry I copied the wrong object name it
>>> should have been "data_qn" which is quantiles normalized probe level
>>> data that I created like this:
>>>
>>> data_hapMap<- root.data(scheme.huex10stv2,
>>> "/data2/paul/normalization_project/root_data/hapMap_root_data_cel.root")
>>> #read raw data
>>> data_qn<- normalize.quantiles(data_hapMap, "exon_quantiles",
>>> filedir=rootdata_Dir, exonlevel="all") #quantiles normalize at *probe*
>>> level
>>>
>>> So "data_qn" is the object I want to work with, based on your advice
>>> I've figured out that to access the expression levels in "data_qn" I
>>> need to use "attachInten()", then I can use "intensities()" to access
>>> the expression levels:
>>>
>>> treenames<- unlist(treeNames(data_qn))
>>> data_qn<- attachInten(data_qn, treenames=treenames[1:2]) #attach the
>>> first two samples
>>>
>>>> head(intensity(data_qn))
>>>
>>>    X Y GSM188869.cqu_MEAN GSM188870.cqu_MEAN
>>> 1 0 0             0.0000              0.000
>>> 2 1 0             0.0000              0.000
>>> 3 2 0             0.0000              0.000
>>> 4 3 0             0.0000              0.000
>>> 5 4 0             0.0000              0.000
>>> 6 5 0            85.3523            121.733
>>>
>>> Does the X column in the output above represent the probe ID? If so
>>> and I have a mapping of probeset IDs to corresponding probe IDs, it
>>> should be fairly straightforward to set probes that are not detected
>>> above background to zero? Perhaps there is a more straightforward way
>>> of doing this though?
>>>
>>
>> No, (X,Y) are the coordinates of the single probes on the exon array. You
>> can use function export.scheme(..,treetype="scm",..) to get the mapping
>> between (X,Y) and an internal PROBESET_ID, e.g.:
>>
>> UNIT_ID X       Y       ProbeLength     Mask    EXON_ID PROBESET_ID
>> 31      986     1674    25      512     31      31
>> 31      1092    677     25      512     31      31
>> 31      796     1862    25      512     31      31
>> 31      917     193     25      512     31      31
>> 31      341     1677    25      512     32      32
>> 31      144     2250    25      512     32      32
>> 31      689     262     25      512     32      32
>> 31      579     1670    25      512     32      32
>>
>> Then you can use export.scheme(..,treetype="pbs",..) to map PROBESET_ID
>> (=UNIT_ID) to the ProbesetID, e.g.:
>>
>> UNIT_ID ProbesetID      NumCells        NumAtoms        NumSubunits
>> UnitType
>> 31      2315101 4       4       1       512
>> 32      2315102 4       4       1       512
>>
>> As you see PROBESET_IDs 31 and 32 have each 4 probes and belong to the
>> Affymetrix ProbesetIDs 2315101 and 2315102, respectively.
>>
>> You could also use functions indexUnits(), and unitID2probesetID() or
>> unitID2transcriptID(), respectively.
>>
>> Best regards
>> Christian
>>
>>
>>
>>>
>>>>
>>>> I guess that "data_hapMap" contains the raw data. For these the slot
>>>> "data"
>>>> is empty to save memory. So you need to use either attachData() or
>>>> attachInten(). However since you are using exon arrays you may not have
>>>> enough RAM, so it would be better to use function export() or
>>>> export.data(),
>>>> or attach only a subset, see help ?attachData. See also vignette xps.pdf
>>>> (chapter 2.3).
>>>
>>>
>>> I think RAM shouldn't be an issue if I attach the samples one at a
>>> time? (I actually have access to a machine with 32gigs RAM but ideally
>>> would like to get what I'm doing to run on a standard desktop, which
>>> is actually why I'm using XPS!).
>>>
>>>>
>>>> When you talk about "expression matrix", how did you create it? Maybe you
>>>> could use function validExpr(), but w/o seeing your code it is hard to
>>>> tell.
>>>> For DABG there are functions pvalData() and presCall(), see the examples
>>>> in
>>>> help ?dabg.call.
>>>
>>>
>>> Yes I've managed to use dabg.call() at probeset level and access the
>>> p-values using pvalData() alright.
>>>
>>> Thanks again for all of your help and patience!
>>>
>>> Kind Regards,
>>>
>>> Paul.
>>>
>>>
>>>>
>>>> Best regards
>>>> Christian
>>>>
>>>>
>>>>
>>>> On 2/6/12 4:33 PM, Paul Geeleher wrote:
>>>>>
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> Thanks for your quick and informative reply.
>>>>>
>>>>> I have re-run the analysis and saved the R objects as you suggest. The
>>>>> next thing I'm trying to do is to obtain the expression levels of the
>>>>> probes, but this doesn't seem to be working for me:
>>>>>
>>>>>> a<- validData(data_hapMap)
>>>>>
>>>>>
>>>>> Error in .local(object, ...) : slot "data" has no data
>>>>>
>>>>> Based on the documentation I think validData() is the correct function.
>>>>>
>>>>> I've also performed probeset level DABG and I'm trying to set
>>>>> individual probes which belong to probesets with DABG<      .05 to 0 in
>>>>> the expression matrix.
>>>>>
>>>>> But it seems I can't see the expression matrix using validData().
>>>>> Perhaps there is another function. Any ideas?
>>>>>
>>>>> Thank you again for your help with this, I'm very grateful!
>>>>>
>>>>> Paul.
>>>>>
>>>>> On 2/2/12, cstrato<cstrato at aon.at>      wrote:
>>>>>>
>>>>>>
>>>>>> Dear Paul,
>>>>>>
>>>>>> The functions root.data(), root.call() and root.expr() were created to
>>>>>> allow you access to the corresponding root files just in case that you
>>>>>> did not save your R session.
>>>>>>
>>>>>> In the cases where you compute expression levels stepwise, or only part
>>>>>> of them such as normalize.quantiles(), as seems to be the matter in
>>>>>> your
>>>>>> case, there is no corresponding root.xxx() function to access the root
>>>>>> file directly. In these cases you need to save your R session to have
>>>>>> continued access to the resulting root file.
>>>>>>
>>>>>> Please note that saving the R session is the usual case to have access
>>>>>> to the root files.
>>>>>>
>>>>>> Best regards
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>> On 2/2/12 1:12 PM, Paul Geeleher wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> Thanks for your quick reply. I check what kind of trees I have using
>>>>>>> "getTreeNames()" as you'd suggested, it seems they are of type "cqu"
>>>>>>> rather than "int", this is presumably because my analysis required no
>>>>>>> background correction step?
>>>>>>>
>>>>>>> So I then tried:
>>>>>>>>
>>>>>>>>
>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root", "cqu")
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> but that gives me a huge number of errors that look like this:
>>>>>>>
>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>> Error in<TFile::cd>: Unknown directory PreprocesSet
>>>>>>> Error: Could not get directory<PreprocesSet>.
>>>>>>> Error: Could not get tree<ExportSet>.
>>>>>>> Error in root.expr(scheme.huex10stv2, "exon_quantiles.root",  :
>>>>>>>     error in function ‘ExportData’
>>>>>>>
>>>>>>>
>>>>>>> This file "exon_quantiles.root" definitely exists in the current
>>>>>>> working directory though... Thanks again for your help!
>>>>>>>
>>>>>>> Paul.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 1, 2012 at 9:01 PM, cstrato<cstrato at aon.at>       wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear Paul,
>>>>>>>>
>>>>>>>> Please have a look at the help ?root.expr.
>>>>>>>>
>>>>>>>> If I understand you correctly, you did only do quantile
>>>>>>>> normalization?
>>>>>>>>
>>>>>>>> To see the tree names in your file you should do:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> getTreeNames("exon_quantiles.root")
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> You will probably see trees with extension "int", see help
>>>>>>>> ?validTreetype.
>>>>>>>>
>>>>>>>> To load these trees you need to do:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> data_qn<- root.expr(scheme.huex10stv2, "exon_quantiles.root", "int")
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know if this did solve your problem.
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>> Christian
>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>> C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
>>>>>>>> V.i.e.n.n.a           A.u.s.t.r.i.a
>>>>>>>> e.m.a.i.l:        cstrato at aon.at
>>>>>>>> _._._._._._._._._._._._._._._._._._
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2/1/12 7:07 PM, Paul Geeleher wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've used xps to quantiles normalize (at probe level) some Affy Exon
>>>>>>>>> Array data. I now have a "root" file called "exon_quantiles.root",
>>>>>>>>> but
>>>>>>>>> if I try to load it the same was I'd load my raw data (using the
>>>>>>>>> scheme file I created for Affy exon arrays) I get the error below? I
>>>>>>>>> can load my raw data just fine though. Any ideas? Do I perhaps need
>>>>>>>>> a
>>>>>>>>> different "root scheme" file for this normalized data?
>>>>>>>>> Unfortunately,
>>>>>>>>> I haven't been able to find an answer.
>>>>>>>>>
>>>>>>>>>> scheme.huex10stv2<- root.scheme("huex10stv2.root")
>>>>>>>>>> data_qn<- root.data(scheme.huex10stv2, "exon_quantiles.root")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Error in if (chipname != treetitle) { : argument is of length zero
>>>>>>>>>
>>>>>>>>> Hope someone can help,
>>>>>>>>>
>>>>>>>>> Paul.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> R version 2.11.0 (2010-04-22)
>>>>>>>>> x86_64-redhat-linux-gnu
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>>>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>>>>>>    [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>>>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>>>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>