[BioC] Help needed with CopyNumber analysis for Affymetrix 250K arrays

Tue Feb 5 16:40:20 CET 2008

On Feb 5, 2008 6:50 AM,
<Christian.Stratowa at vie.boehringer-ingelheim.com> wrote:
> Dear Henrik
>
> I have just downloaded and installed the aroma.affymetrix package, however,
> it seems that the package enforces a certain naming convention and directory
> structure, so I need some time to test it.

Yes.  That directory structure is there in order to help you - it was
designed so that the package can detect common user mistakes but also
make it easier for the user to see what analyses have been done from
looking at the directory names.  It's very straightforward once you
get started.

/Henrik

>
> Best regards
> Christian
>
> ==============================================
> Christian Stratowa, PhD
> Boehringer Ingelheim Austria
> Dept NCE Lead Discovery - Bioinformatics
> Dr. Boehringergasse 5-11
> A-1121 Vienna, Austria
> Tel.: ++43-1-80105-2470
> Fax: ++43-1-80105-2782
> email: christian.stratowa at vie.boehringer-ingelheim.com
>
>
>
> -----Original Message-----
> From: henrik.bengtsson at gmail.com [mailto:henrik.bengtsson at gmail.com] On
> Behalf Of Henrik Bengtsson
> Sent: Tuesday, February 05, 2008 12:29 AM
> To: Stratowa,Dr.,Christian FEX BIG-AT-V
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Help needed with CopyNumber analysis for Affymetrix 250K
> arrays
>
>
>
> Hi Christian,
>
> On Feb 4, 2008 1:45 AM, <Christian.Stratowa at vie.boehringer-ingelheim.com>
> wrote:
> > Dear all,
> >
> > Until now I have done all CopyNumber (and LOH) analysis using
> > Affymetrix CNAT4. However, I would prefer to use Bioconductor for this
> > purpose, thus I have a couple of questions:
> >
> >
> > 1, Normalization and summarization of mapping array 50K and 250K
> > CEL-files:
> >
> > Currently, there seem to be only two packages available, which are
> > able to read mapping array CEL-files, namely: package "oligo" and
> > packages "PLASQ" and "PLASQ500K", respectively.
> >
> > Using package "oligo" I can do:
> > > library(oligo)
> > > snprma250 <- justSNPRMA(cels250, phenoData=pheno250)
> > Then I get the normalized intensities:
> > > asTA250 <- antisenseThetaA(snprma250)
> > > asTB250 <- antisenseThetaB(snprma250)
> > > sTA250  <- senseThetaA(snprma250)
> > > sTB250  <- senseThetaB(snprma250)
> >
> > Using package "PLASQ500K" I can do:
> > > library(PLASQ500K)
> > > ref <- celExtNorm("SND", "Sty")
> > > sam <- celExtract("STD", "Sty")
> > I get a matrix of normalized probe intensities for reference (ref) and
> > samples (sam).
> >
> > Are there other packages available which can use mapping array
> > CEL-files?
>
> The aroma.affymetrix package [http://www.braju.com/R/aroma.affymetrix/] works
> of CEL (and CDF) files.  See URL for examples and documentation.
>
> >
> >
> > 2, Genotyping:
> >
> > Package "oligo" can be used for genotyping:
> > > crlmmOut250 <- justCRLMM(cels250, phenoData=pheno250) genocall250 <-
> > > calls(crlmmOut250) genoconf250 <- callsConfidence(crlmmOut250)
> >
> > However, the following results in an error:
> > > snprma250 <- justSNPRMA(cels250, phenoData=pheno250) crlmmOut250 <-
> > > crlmm(snprma250, correctionFile="outputEM.rda")
> > see:
> > https://stat.ethz.ch/pipermail/bioconductor/attachments/20080128/50495
> > 06c/att
> > achment.pl
> >
> > Package "PLASQ500K" could also be used for genotyping:
> > > geno <- EMSNP(???)
> > Although I did not try it, this function seems to have a huge memory
> > problem, see below.
> >
> >
> > 3, CopyNumber analysis:
> >
> > Although there seem to be some packages which could use the output
> > from the Affymetrix CNAT4 results, it seems that there is currently no
> > package able to do copynumber analysis for Affymetrix mapping arrays.
> > Is this correct?
>
> The aroma.affymetrix package can estimate paired & non-paired total raw copy
> numbers, cf.
>
> H. Bengtsson; R. Irizarry; B. Carvalho; T.P. Speed, Estimation and assessment
> of raw copy numbers at the single locus level, Bioinformatics, 2008. [pmid:
> 18204055]
>
> The package also implement other methods for estimating raw CNs. Currently
> GLAD and CBS are the supported segmentation methods, and it is not that hard
> to add wrappers for other segmentation methods.
>
> The aroma.affymetrix package does not do genotyping; at one stage there was a
> wrapper to call CRLMM in 'oligo' but due to lack of time it has become
> obsolete.  It's on the (way to long) to do list to fix that.
>
> Best,
>
> Henrik
>
> >
> > 3a, CNRLMM:
> > In a Johns Hopkins Tech Report, Paper 122, 2006, Wang, Caravalho et al
> > describe a new copynumber algorithm, which they want to make available
> > at Bioconductor.
> > Does anybody know when the CNRLMM algorithm will be available?
> >
> > 3b, PLASQ500K
> > I tried to compute parent-specific copy number using PLASQ500K:
> > > library(PLASQ500K)
> > > psCN <-
> > pscn(StyFolder="STD",normStyFolder="SND",betasSty=NULL,quantSty=NULL,b
> > etasSty
> > File="betasSty.Rdata",rawCNStyfile="rawCNSty.Rdata")
> >
> > Using only 18 250K Sty CEL-files it was impossible to finish this
> > calculation. On a 32GB RAM Linux server the job got killed, since
> > function EMSNP() which is
> > called from function getBetas() used up all RAM. Starting the computation
> on
> > our 64GB RAM Linux server, function EMSNP() could be executed,
> nevertheless,
> > we had to kill the job, when it reached memory consumption of 74GB!!! at a
> > later stage!
> >
> >
> > 3c, Compute raw copy numbers for unpaired copynumber analysis: Using
> > the results from justSNPRMA() I tried to compute the copynumbers in
> > the following way:
> >
> > # Reference files
> > snprma250ref <- justSNPRMA(cels250ref, phenoData=pheno250ref)
> >
> > # Sample files
> > snprma250sam <- justSNPRMA(cels250sam, phenoData=pheno250sam)
> >
> > ## separate allels combined as in CNAT4, see
> > cnat_4_algorithm_whitepaper.pdf, page 9: # TCN(sumLog) =
> > log2(SamA/RefA) + log2(SamB/RefB)
> >
> > # Reference for allele A:
> > # allele A as array
> > ref250A <- array(NA,
> > dim=c(nrow(antisenseThetaA(snprma250ref)),ncol(antisenseThetaA(snprma2
> > 50ref))
> > , 2),
> >
> > dimnames=list(rownames(antisenseThetaA(snprma250ref)),colnames(antisen
> > seTheta
> > A(snprma250ref)),c("antisense","sense")))
> > ref250A[,,1] <- antisenseThetaA(snprma250ref)
> > ref250A[,,2] <- senseThetaA(snprma250ref)
> >
> > # Reference A: rowMeans over sense and antisense strand
> > refA <-
> > sapply(1:dim(ref250A)[2],function(x)rowMeans(ref250A[,x,],na.rm=T))
> > colnames(refA) <- colnames(ref250A)
> >
> > # Reference for allele B:
> > # allele B as array
> > ref250B <- array(NA,
> > dim=c(nrow(antisenseThetaB(snprma250ref)),ncol(antisenseThetaB(snprma2
> > 50ref))
> > , 2),
> >
> > dimnames=list(rownames(antisenseThetaB(snprma250ref)),colnames(antisen
> > seTheta
> > B(snprma250ref)),c("antisense","sense")))
> > ref250B[,,1] <- antisenseThetaB(snprma250ref)
> > ref250B[,,2] <- senseThetaB(snprma250ref)
> >
> > # Reference B: rowMeans over sense and antisense strand
> > refB <-
> > sapply(1:dim(ref250B)[2],function(x)rowMeans(ref250B[,x,],na.rm=T))
> > colnames(refB) <- colnames(ref250B)
> >
> > # Sample for allele A:
> > # allele A as array
> > sam250A <- array(NA,
> > dim=c(nrow(antisenseThetaA(snprma250sam)),ncol(antisenseThetaA(snprma2
> > 50sam))
> > , 2),
> >
> > dimnames=list(rownames(antisenseThetaA(snprma250sam)),colnames(antisen
> > seTheta
> > A(snprma250sam)),c("antisense","sense")))
> > sam250A[,,1] <- antisenseThetaA(snprma250sam)
> > sam250A[,,2] <- senseThetaA(snprma250sam)
> >
> > # Sample A: rowMeans over sense and antisense strand
> > samA <-
> > sapply(1:dim(sam250A)[2],function(x)rowMeans(sam250A[,x,],na.rm=T))
> > colnames(samA) <- colnames(sam250A)
> >
> > # Sample for allele B:
> > # allele B as array
> > sam250B <- array(NA,
> > dim=c(nrow(antisenseThetaB(snprma250sam)),ncol(antisenseThetaB(snprma2
> > 50sam))
> > , 2),
> >
> > dimnames=list(rownames(antisenseThetaB(snprma250sam)),colnames(antisen
> > seTheta
> > B(snprma250sam)),c("antisense","sense")))
> > sam250B[,,1] <- antisenseThetaB(snprma250sam)
> > sam250B[,,2] <- senseThetaB(snprma250sam)
> >
> > # Sample B: rowMeans over sense and antisense strand
> > samB <-
> > sapply(1:dim(sam250B)[2],function(x)rowMeans(sam250B[,x,],na.rm=T))
> > colnames(samB) <- colnames(sam250B)
> >
> > # Total CopyNumber TCN(sumLog), see cnat_4_algorithm_whitepaper.pdf,
> > page 9 TCN.sL <- (samA - rowMeans(refA)) + (samB - rowMeans(refB))
> >
> > # real copy number is: cn = 2^(2^cn) ?? (or 2^(cn+1) ??) cn.sL <-
> > 2^(2^TCN.sL)
> > head(cn.sL)
> > #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> > #SNP_A-1780271            1.801377            3.034645            2.314986
> > #SNP_A-1780274            2.017805            2.494345            2.370112
> > #SNP_A-1780277            1.558268            2.446690            2.983195
> > #SNP_A-1780278            1.879762            1.859002            1.697422
> > #SNP_A-1780283            2.064631            1.639300            1.912674
> > #SNP_A-1780290            2.142572            2.738094            2.029215
> >
> > # or alternatively: cn = 2^cnA + 2^cnB ??
> > cn <- 2^(samA - rowMeans(refA)) + 2^(samB - rowMeans(refB))
> > head(cn)
> > #              CEU_NA06993_STY.CEL CEU_NA06994_STY.CEL CEU_NA07022_STY.CEL
> > #SNP_A-1780271            1.859447            2.786287            2.369363
> > #SNP_A-1780274            2.160573            2.315243            2.271201
> > #SNP_A-1780277            3.203198            2.453341            2.773667
> > #SNP_A-1780278            1.908932            1.990323            1.748716
> > #SNP_A-1780283            2.046767            1.691257            1.937375
> > #SNP_A-1780290            2.098621            2.416547            2.020832
> >
> > - Is this computation correct?
> >
> > - Is this way to compute the copynumbers a valuable option?
> >
> > - Are there any alternatives to compute the copynumbers using R
> > packages?
> >
> >
> > Thank you in advance
> > Best regards
> > Christian
> >
> > ==============================================
> > Christian Stratowa, PhD
> > Boehringer Ingelheim Austria
> > Dept NCE Lead Discovery - Bioinformatics
> > Dr. Boehringergasse 5-11
> > A-1121 Vienna, Austria
> > Tel.: ++43-1-80105-2470
> > Fax: ++43-1-80105-2782
> > email: christian.stratowa at vie.boehringer-ingelheim.com
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>