[BioC] Help needed with CopyNumber analysis for Affymetrix 250K arrays

Tue Feb 5 00:28:54 CET 2008

Hi Christian,

On Feb 4, 2008 1:45 AM,
<Christian.Stratowa at vie.boehringer-ingelheim.com> wrote:
> Dear all,
>
> Until now I have done all CopyNumber (and LOH) analysis using Affymetrix
> CNAT4.
> However, I would prefer to use Bioconductor for this purpose, thus I have a
> couple of questions:
>
>
> 1, Normalization and summarization of mapping array 50K and 250K CEL-files:
>
> Currently, there seem to be only two packages available, which are able to
> read mapping array CEL-files, namely:
> package "oligo" and packages "PLASQ" and "PLASQ500K", respectively.
>
> Using package "oligo" I can do:
> > library(oligo)
> > snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
> Then I get the normalized intensities:
> > asTA250 <- antisenseThetaA(snprma250)
> > asTB250 <- antisenseThetaB(snprma250)
> > sTA250  <- senseThetaA(snprma250)
> > sTB250  <- senseThetaB(snprma250)
>
> Using package "PLASQ500K" I can do:
> > library(PLASQ500K)
> > ref <- celExtNorm("SND", "Sty")
> > sam <- celExtract("STD", "Sty")
> I get a matrix of normalized probe intensities for reference (ref) and
> samples (sam).
>
> Are there other packages available which can use mapping array CEL-files?

The aroma.affymetrix package
[http://www.braju.com/R/aroma.affymetrix/] works of CEL (and CDF)
files.  See URL for examples and documentation.

>
>
> 2, Genotyping:
>
> Package "oligo" can be used for genotyping:
> > crlmmOut250 <- justCRLMM(cels250, phenoData=pheno250)
> > genocall250 <- calls(crlmmOut250)
> > genoconf250 <- callsConfidence(crlmmOut250)
>
> However, the following results in an error:
> > snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
> > crlmmOut250 <- crlmm(snprma250, correctionFile="outputEM.rda")
> see:
> https://stat.ethz.ch/pipermail/bioconductor/attachments/20080128/5049506c/att
> achment.pl
>
> Package "PLASQ500K" could also be used for genotyping:
> > geno <- EMSNP(???)
> Although I did not try it, this function seems to have a huge memory problem,
> see below.
>
>
> 3, CopyNumber analysis:
>
> Although there seem to be some packages which could use the output from the
> Affymetrix CNAT4 results, it seems that there is currently no package able to
> do copynumber analysis for Affymetrix mapping arrays. Is this correct?

The aroma.affymetrix package can estimate paired & non-paired total
raw copy numbers, cf.

H. Bengtsson; R. Irizarry; B. Carvalho; T.P. Speed, Estimation and
assessment of raw copy numbers at the single locus level,
Bioinformatics, 2008. [pmid: 18204055]

The package also implement other methods for estimating raw CNs.
Currently GLAD and CBS are the supported segmentation methods, and it
is not that hard to add wrappers for other segmentation methods.

The aroma.affymetrix package does not do genotyping; at one stage
there was a wrapper to call CRLMM in 'oligo' but due to lack of time
it has become obsolete.  It's on the (way to long) to do list to fix
that.

Best,

Henrik

>
> 3a, CNRLMM:
> In a Johns Hopkins Tech Report, Paper 122, 2006, Wang, Caravalho et al
> describe
> a new copynumber algorithm, which they want to make available at
> Bioconductor.
> Does anybody know when the CNRLMM algorithm will be available?
>
> 3b, PLASQ500K
> I tried to compute parent-specific copy number using PLASQ500K:
> > library(PLASQ500K)
> > psCN <-
> pscn(StyFolder="STD",normStyFolder="SND",betasSty=NULL,quantSty=NULL,betasSty
> File="betasSty.Rdata",rawCNStyfile="rawCNSty.Rdata")
>
> Using only 18 250K Sty CEL-files it was impossible to finish this
> calculation.
> On a 32GB RAM Linux server the job got killed, since function EMSNP() which
> is
> called from function getBetas() used up all RAM. Starting the computation on
> our 64GB RAM Linux server, function EMSNP() could be executed, nevertheless,
> we had to kill the job, when it reached memory consumption of 74GB!!! at a
> later stage!
>
>
> 3c, Compute raw copy numbers for unpaired copynumber analysis:
> Using the results from justSNPRMA() I tried to compute the copynumbers
> in the following way:
>
> # Reference files
> snprma250ref <- justSNPRMA(cels250ref, phenoData=pheno250ref)
>
> # Sample files
> snprma250sam <- justSNPRMA(cels250sam, phenoData=pheno250sam)
>
> ## separate allels combined as in CNAT4, see cnat_4_algorithm_whitepaper.pdf,
> page 9:
> # TCN(sumLog) = log2(SamA/RefA) + log2(SamB/RefB)
>
> # Reference for allele A:
> # allele A as array
> ref250A <- array(NA,
> dim=c(nrow(antisenseThetaA(snprma250ref)),ncol(antisenseThetaA(snprma250ref))
> , 2),
>
> dimnames=list(rownames(antisenseThetaA(snprma250ref)),colnames(antisenseTheta
> A(snprma250ref)),c("antisense","sense")))
> ref250A[,,1] <- antisenseThetaA(snprma250ref)
> ref250A[,,2] <- senseThetaA(snprma250ref)
>
> # Reference A: rowMeans over sense and antisense strand
> refA <- sapply(1:dim(ref250A)[2],function(x)rowMeans(ref250A[,x,],na.rm=T))
> colnames(refA) <- colnames(ref250A)
>
> # Reference for allele B:
> # allele B as array
> ref250B <- array(NA,
> dim=c(nrow(antisenseThetaB(snprma250ref)),ncol(antisenseThetaB(snprma250ref))
> , 2),
>
> dimnames=list(rownames(antisenseThetaB(snprma250ref)),colnames(antisenseTheta
> B(snprma250ref)),c("antisense","sense")))
> ref250B[,,1] <- antisenseThetaB(snprma250ref)
> ref250B[,,2] <- senseThetaB(snprma250ref)
>
> # Reference B: rowMeans over sense and antisense strand
> refB <- sapply(1:dim(ref250B)[2],function(x)rowMeans(ref250B[,x,],na.rm=T))
> colnames(refB) <- colnames(ref250B)
>
> # Sample for allele A:
> # allele A as array
> sam250A <- array(NA,
> dim=c(nrow(antisenseThetaA(snprma250sam)),ncol(antisenseThetaA(snprma250sam))
> , 2),
>
> dimnames=list(rownames(antisenseThetaA(snprma250sam)),colnames(antisenseTheta
> A(snprma250sam)),c("antisense","sense")))
> sam250A[,,1] <- antisenseThetaA(snprma250sam)
> sam250A[,,2] <- senseThetaA(snprma250sam)
>
> # Sample A: rowMeans over sense and antisense strand
> samA <- sapply(1:dim(sam250A)[2],function(x)rowMeans(sam250A[,x,],na.rm=T))
> colnames(samA) <- colnames(sam250A)
>
> # Sample for allele B:
> # allele B as array
> sam250B <- array(NA,
> dim=c(nrow(antisenseThetaB(snprma250sam)),ncol(antisenseThetaB(snprma250sam))
> , 2),
>
> dimnames=list(rownames(antisenseThetaB(snprma250sam)),colnames(antisenseTheta
> B(snprma250sam)),c("antisense","sense")))
> sam250B[,,1] <- antisenseThetaB(snprma250sam)
> sam250B[,,2] <- senseThetaB(snprma250sam)
>
> # Sample B: rowMeans over sense and antisense strand
> samB <- sapply(1:dim(sam250B)[2],function(x)rowMeans(sam250B[,x,],na.rm=T))
> colnames(samB) <- colnames(sam250B)
>
> # Total CopyNumber TCN(sumLog), see cnat_4_algorithm_whitepaper.pdf, page 9
> TCN.sL <- (samA - rowMeans(refA)) + (samB - rowMeans(refB))
>
> # real copy number is: cn = 2^(2^cn) ?? (or 2^(cn+1) ??)
> cn.sL <- 2^(2^TCN.sL)
> head(cn.sL)
> #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> #SNP_A-1780271            1.801377            3.034645            2.314986
> #SNP_A-1780274            2.017805            2.494345            2.370112
> #SNP_A-1780277            1.558268            2.446690            2.983195
> #SNP_A-1780278            1.879762            1.859002            1.697422
> #SNP_A-1780283            2.064631            1.639300            1.912674
> #SNP_A-1780290            2.142572            2.738094            2.029215
>
> # or alternatively: cn = 2^cnA + 2^cnB ??
> cn <- 2^(samA - rowMeans(refA)) + 2^(samB - rowMeans(refB))
> head(cn)
> #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> #SNP_A-1780271            1.859447            2.786287            2.369363
> #SNP_A-1780274            2.160573            2.315243            2.271201
> #SNP_A-1780277            3.203198            2.453341            2.773667
> #SNP_A-1780278            1.908932            1.990323            1.748716
> #SNP_A-1780283            2.046767            1.691257            1.937375
> #SNP_A-1780290            2.098621            2.416547            2.020832
>
> - Is this computation correct?
>
> - Is this way to compute the copynumbers a valuable option?
>
> - Are there any alternatives to compute the copynumbers using R packages?
>
>
> Thank you in advance
> Best regards
> Christian
>
> ==============================================
> Christian Stratowa, PhD
> Boehringer Ingelheim Austria
> Dept NCE Lead Discovery - Bioinformatics
> Dr. Boehringergasse 5-11
> A-1121 Vienna, Austria
> Tel.: ++43-1-80105-2470
> Fax: ++43-1-80105-2782
> email: christian.stratowa at vie.boehringer-ingelheim.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>