[BioC] stat/math question on Category vignette

Thu Aug 23 16:12:49 CEST 2007

Hi Mark,

Mark W Kimpel wrote:
> I am working my way through the Category vignette and have a question as 
> to how the t statistics for categories are computed from the incidence 
> matrix and individual probeset t-statistics. The code that does this can 
> be found on the bottom of page 3 (development version vignette) and is 
> as follows:
> 
> There are 135 pathways (categories)...
> A = AmER2 %*% tobs$statistic
> A = tA/sqrt(rs2)
> ames(tA) = row.names(AmER2)

Actually you have a typo here. It should read

tA = AmER2 %*% tobs$statistic
tA = tA/sqrt(rs2)

As for the computation being done here, it is actually very simple. 
AmER2 is a matrix of dimension [npathways x nprobesets], where npathways 
is the number of pathways you are interrogating, and nprobesets is the 
number of probesets that remain after you do all the filtering steps 
that preceded this part.

Each row of AmER2 consists of zeros and ones; a zero if the 
corresponding probeset doesn't map to that particular pathway, and a one 
if it does. By computing AmER2 %*% tobs$statistic, we are (in one shot) 
doing the same as

apply(AmER2, 1, function(x) sum(tobs$statistic[as.logical(x)])

In other words, we are just summing for each row the t-statistics of the 
probesets that are in a particular pathway. Since there will be a 
different number of statistics that are being summed, we then divide by 
sqrt(rs2), which is just the square root of the number of t-statistics 
summed. We do this to normalize the sums.

> 
> I know this is matrix multiplication, but don't know the mathematical or 
> statistical basis for the computation. I am interested in turning the t 
> statistic values in tA into p values, so I need to know the df. for each 
> resultant t. Is that the rs2?

So to answer this question, the values in tA aren't t-statistics. They 
are sums of t-statistics. If you look at the top of the page you are 
quoting, you can see that if we make some assumptions, these values are 
approximately multivariate normal, so you don't need to know the df.

If you don't want to assume multivariate normal, you can permute to get 
the p-value as is done on page 6.

Best,

Jim

> 
> This is know doubt a simple question for the statisticians in the group, 
> but not for me! :) Thanks for your help,
> 
> Mark
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623