[BioC] Help needed with CopyNumber analysis for Affymetrix 250K arrays

Mon Feb 4 10:45:49 CET 2008

Dear all,

Until now I have done all CopyNumber (and LOH) analysis using Affymetrix
CNAT4.
However, I would prefer to use Bioconductor for this purpose, thus I have a
couple of questions:

1, Normalization and summarization of mapping array 50K and 250K CEL-files:

Currently, there seem to be only two packages available, which are able to 
read mapping array CEL-files, namely: 
package "oligo" and packages "PLASQ" and "PLASQ500K", respectively.

Using package "oligo" I can do:
> library(oligo)
> snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
Then I get the normalized intensities:
> asTA250 <- antisenseThetaA(snprma250)
> asTB250 <- antisenseThetaB(snprma250)
> sTA250  <- senseThetaA(snprma250)
> sTB250  <- senseThetaB(snprma250)

Using package "PLASQ500K" I can do:
> library(PLASQ500K)
> ref <- celExtNorm("SND", "Sty")
> sam <- celExtract("STD", "Sty")
I get a matrix of normalized probe intensities for reference (ref) and
samples (sam).

Are there other packages available which can use mapping array CEL-files?

2, Genotyping:

Package "oligo" can be used for genotyping:
> crlmmOut250 <- justCRLMM(cels250, phenoData=pheno250)
> genocall250 <- calls(crlmmOut250)
> genoconf250 <- callsConfidence(crlmmOut250)

However, the following results in an error:
> snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
> crlmmOut250 <- crlmm(snprma250, correctionFile="outputEM.rda")
see:
https://stat.ethz.ch/pipermail/bioconductor/attachments/20080128/5049506c/att
achment.pl

Package "PLASQ500K" could also be used for genotyping:
> geno <- EMSNP(???)
Although I did not try it, this function seems to have a huge memory problem,
see below.

3, CopyNumber analysis:

Although there seem to be some packages which could use the output from the
Affymetrix CNAT4 results, it seems that there is currently no package able to
do copynumber analysis for Affymetrix mapping arrays. Is this correct?

3a, CNRLMM:
In a Johns Hopkins Tech Report, Paper 122, 2006, Wang, Caravalho et al
describe 
a new copynumber algorithm, which they want to make available at
Bioconductor. 
Does anybody know when the CNRLMM algorithm will be available?

3b, PLASQ500K
I tried to compute parent-specific copy number using PLASQ500K:
> library(PLASQ500K)
> psCN <-
pscn(StyFolder="STD",normStyFolder="SND",betasSty=NULL,quantSty=NULL,betasSty
File="betasSty.Rdata",rawCNStyfile="rawCNSty.Rdata")

Using only 18 250K Sty CEL-files it was impossible to finish this
calculation. 
On a 32GB RAM Linux server the job got killed, since function EMSNP() which
is
called from function getBetas() used up all RAM. Starting the computation on
our 64GB RAM Linux server, function EMSNP() could be executed, nevertheless, 
we had to kill the job, when it reached memory consumption of 74GB!!! at a
later stage!

3c, Compute raw copy numbers for unpaired copynumber analysis:
Using the results from justSNPRMA() I tried to compute the copynumbers
in the following way:

# Reference files
snprma250ref <- justSNPRMA(cels250ref, phenoData=pheno250ref)

# Sample files
snprma250sam <- justSNPRMA(cels250sam, phenoData=pheno250sam)

## separate allels combined as in CNAT4, see cnat_4_algorithm_whitepaper.pdf,
page 9:
# TCN(sumLog) = log2(SamA/RefA) + log2(SamB/RefB)

# Reference for allele A:
# allele A as array
ref250A <- array(NA,
dim=c(nrow(antisenseThetaA(snprma250ref)),ncol(antisenseThetaA(snprma250ref))
, 2),

dimnames=list(rownames(antisenseThetaA(snprma250ref)),colnames(antisenseTheta
A(snprma250ref)),c("antisense","sense")))
ref250A[,,1] <- antisenseThetaA(snprma250ref)
ref250A[,,2] <- senseThetaA(snprma250ref)

# Reference A: rowMeans over sense and antisense strand
refA <- sapply(1:dim(ref250A)[2],function(x)rowMeans(ref250A[,x,],na.rm=T))
colnames(refA) <- colnames(ref250A)

# Reference for allele B:
# allele B as array
ref250B <- array(NA,
dim=c(nrow(antisenseThetaB(snprma250ref)),ncol(antisenseThetaB(snprma250ref))
, 2),

dimnames=list(rownames(antisenseThetaB(snprma250ref)),colnames(antisenseTheta
B(snprma250ref)),c("antisense","sense")))
ref250B[,,1] <- antisenseThetaB(snprma250ref)
ref250B[,,2] <- senseThetaB(snprma250ref)

# Reference B: rowMeans over sense and antisense strand
refB <- sapply(1:dim(ref250B)[2],function(x)rowMeans(ref250B[,x,],na.rm=T))
colnames(refB) <- colnames(ref250B)

# Sample for allele A:
# allele A as array
sam250A <- array(NA,
dim=c(nrow(antisenseThetaA(snprma250sam)),ncol(antisenseThetaA(snprma250sam))
, 2),

dimnames=list(rownames(antisenseThetaA(snprma250sam)),colnames(antisenseTheta
A(snprma250sam)),c("antisense","sense")))
sam250A[,,1] <- antisenseThetaA(snprma250sam)
sam250A[,,2] <- senseThetaA(snprma250sam)

# Sample A: rowMeans over sense and antisense strand
samA <- sapply(1:dim(sam250A)[2],function(x)rowMeans(sam250A[,x,],na.rm=T))
colnames(samA) <- colnames(sam250A)

# Sample for allele B:
# allele B as array
sam250B <- array(NA,
dim=c(nrow(antisenseThetaB(snprma250sam)),ncol(antisenseThetaB(snprma250sam))
, 2),

dimnames=list(rownames(antisenseThetaB(snprma250sam)),colnames(antisenseTheta
B(snprma250sam)),c("antisense","sense")))
sam250B[,,1] <- antisenseThetaB(snprma250sam)
sam250B[,,2] <- senseThetaB(snprma250sam)

# Sample B: rowMeans over sense and antisense strand
samB <- sapply(1:dim(sam250B)[2],function(x)rowMeans(sam250B[,x,],na.rm=T))
colnames(samB) <- colnames(sam250B)

# Total CopyNumber TCN(sumLog), see cnat_4_algorithm_whitepaper.pdf, page 9
TCN.sL <- (samA - rowMeans(refA)) + (samB - rowMeans(refB))

# real copy number is: cn = 2^(2^cn) ?? (or 2^(cn+1) ??)
cn.sL <- 2^(2^TCN.sL)
head(cn.sL)
#              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
#SNP_A-1780271            1.801377            3.034645            2.314986
#SNP_A-1780274            2.017805            2.494345            2.370112
#SNP_A-1780277            1.558268            2.446690            2.983195
#SNP_A-1780278            1.879762            1.859002            1.697422
#SNP_A-1780283            2.064631            1.639300            1.912674
#SNP_A-1780290            2.142572            2.738094            2.029215

# or alternatively: cn = 2^cnA + 2^cnB ??
cn <- 2^(samA - rowMeans(refA)) + 2^(samB - rowMeans(refB))
head(cn)
#              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
#SNP_A-1780271            1.859447            2.786287            2.369363
#SNP_A-1780274            2.160573            2.315243            2.271201
#SNP_A-1780277            3.203198            2.453341            2.773667
#SNP_A-1780278            1.908932            1.990323            1.748716
#SNP_A-1780283            2.046767            1.691257            1.937375
#SNP_A-1780290            2.098621            2.416547            2.020832

- Is this computation correct?

- Is this way to compute the copynumbers a valuable option?

- Are there any alternatives to compute the copynumbers using R packages?

Thank you in advance
Best regards
Christian

==============================================
Christian Stratowa, PhD
Boehringer Ingelheim Austria
Dept NCE Lead Discovery - Bioinformatics
Dr. Boehringergasse 5-11
A-1121 Vienna, Austria
Tel.: ++43-1-80105-2470
Fax: ++43-1-80105-2782
email: christian.stratowa at vie.boehringer-ingelheim.com