[BioC] Is there an expected coefficient of variation for the hybridization probe intensities among technical replicates?

Thu Feb 27 19:41:41 CET 2014

Dear Matt,

Jim was already so kind to show you how he calculates the CoV, and just 
while I was trying to reply, Jim did also answer your question regarding 
using the raw data from the CEL-files, even before background correction 
and quantile normalization.

Maybe I can add some comments on the evaluation of the quality of 
technical replicates:

First note that in an ideal world technical replicates should give 
identical intensities for ALL probes, not only the AFFX controls. Thus 
the most simple way to evaluate the quality of the replicates is to draw 
scatter plots of the raw data. You can find an example on page 35 of 
vignette xps.pdf.

Furthermore you could use NUSE and RLE plots to see the effect of 
background correction and normalization compared to the raw data. Once 
again you can find examples on pages 26-27 of vignette xps.pdf.

Other possibilities are borderplot() and coiplot() (see pages 23-24).

If you want to know how good your replicates are I would suggest to 
download data examples from Affymetrix and sample sets from GEO which 
contain replicates, and compare the variation in your replicates to the 
other replicates.

Best regards,
Christian

On 2/27/14 12:02 AM, Thornton, Matthew wrote:
> Hello!
>
> I am digging through the control intensities for technical replicates obtained with an Affymetrix GeneChip Rat Gene 2.0 ST array.
>
> I have 3 technical replicates of the same sample. The kernel density plot overlay of the raw data shows differences in the intensity distributions.
>
> When I pull out the raw intensities with grepl("^AFFX", GeneName), I get the recorded intensities for each pixel mapped to a TranscriptClusterID corresponding to the particular AFFX control, some are 20 pixels others are 11.  The groups of AFFX controls I get are the hybridization controls: AFFX-BioB, AFFX-BioC, AFFX-BioDn,
> AFFX-CreX; then the PolyA controls:AFFX-DapX, AFFX-LysX, AFFX-PheX, AFFX-ThrX, AFFX-TrpnX (?! not listed anywhere, is it a PolyA or hyb?), AFFX-r2-Bs-dap, AFFX-r2-Bs-lys, AFFX-r2-Bs-phe, and AFFX-r2-Bs-thr. Another set of PolyA controls: AFFX-r2-Ec-bioB, AFFX-r2-Ec-bioC, AFFX-r2-Ec-bioD, AFFX-r2-P1-cre.  I also get the ERCC spike-in controls (which we did use). The maximum intensity from all controls is 25,069 grey levels the average is 1500 grey levels. What is a "good" average intensity?
>
> If I calculate the coefficient of variation across each row in my data set, which corresponds to each X,Y intensity measurement across my replicates for each probe.  I see an average coefficient of variation at 22% just for the hybridization controls. Is this consistent with the experience of the list? What should be the expected coefficient of variation for the hybridization controls across technical replicates? the coefficient of variation is very similar for samples that show very low intensity values which are probably not expressed and for samples close to the average intensity. Even though I haven't done it, I don't think excluding probes based on a threshold with improve the average, because the numbers are uniformly similar.
>
> Here is my code:
> #!/usr/bin/Rscript --vanilla --slave
>
> # Load Libraries
> library(Biobase)
> library(xps)
>
> # Directory to store ROOT scheme files
> scmdir <- "/data/met/scmdir/"
> scheme_RaGene <- root.scheme(file.path(scmdir, "scheme_RaGene20stv1.root"))
>
> # Collect scheme information - Pulls out information from the Root scheme
> ann <- export(scheme_RaGene, treetype="ann", outfile="scheme_RaGene_ann.txt", as.dataframe=TRUE, verbose=FALSE)
> scm <- export(scheme_RaGene, treetype="scm", outfile="scheme_RaGene_scm.txt", as.dataframe=TRUE, verbose=FALSE)
>
> # more file locations
> datdir <- "/data/met/26Feb14_TR_CTRs/"
> celdir <- "/data/met/26Feb14_TR_CTRs/CEL/"
> celfiles <- c("CTR_TR1.CEL", "CTR_TR2.CEL", "CTR_TR3.CEL")
> celnames <- c("CTR_TR1", "CTR_TR2", "CTR_TR3")
> outdir <- "/data/met/26Feb14_TR_CTRs/"
>
> # Import CEL files and begin
> data_raw <- import.data(scheme_RaGene, "raw_data_CTR", filedir=outdir, celdir=celdir, celfiles=celfiles, celnames=celnames, verbose=TRUE)
>
> # Attach info
> data_raw <- attachMask(data_raw)
> data_raw <- attachInten(data_raw)
>
> tmp <- intensity(data_raw)
>
> write.csv(tmp, file="raw_data.csv", row.names=FALSE)
>
> # Save an image
> save.image(file="24Feb14_1.RData")
>
> ##
> ## Manipulation of "Raw Data"
> ##
>
> # plots
> png(file="Raw_Data_Density_Plot.png", width=600, height=600)
> par(mar=c(6,3,1,1));
> hist(data_raw, add.legend=TRUE)
> dev.off()
>
> # Get AFFX controls for assessment of PolyA, ERCC, and Hyb
> annot_raw <- merge(tmp, scm, by=c("X", "Y"), all.x=TRUE)
> annot_raw2 <- merge(annot_raw, ann, all.x=TRUE)
> raw_data <- annot_raw2[ order(annot_raw2[,3], annot_raw2[,2]), ]
> write.csv(raw_data, file="raw_data_annotated.csv", row.names=FALSE)
>
> # get the only handle for ERCC
> set.affx <- subset(raw_data, grepl("^AFFX", GeneName))
> write.csv(set.affx, file="affx_controls_raw.csv", row.names=FALSE)
>
> # I usually open the csv file with a spreadsheet program delete the columns that I don't need and calculate the coefficient of variation for each set.
>
> Any comments, advice, or information are greatly appreciated!!
>
> Matt
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>