[BioC] Affy vs. cDNA : Low and not expressed genes

Park, Richard Richard.Park at joslin.harvard.edu
Tue Jun 3 14:43:11 MEST 2003


Dear Adai, 
I can not tell you what is the correct way of doing analysis, but maybe I can give you a little insight by telling you how I do analysis on my own chips. I am computational biologist and I have been doing most of my labs microarray analysis for the past year. 

We use affymetrix chips in our lab and we do not remove any genes before normalization. Our types of experiments run from wt vs ko, various time course treatments, as well as comparing various cell types from various cell sorts. 

If you want to go along the idea of removing absent genes from your chips, you shouldn't be removing all of the absent genes for each time point, you should remove only those genes that are absent at all 3 points and then perhaps normalize. 

However, I have not been a big fan of the absent, present, and marginal calls, ever since I moved away from the affymetrix 5.0 processing to RMA processing. Whenever I analyze a microarray experiment I always normalize everything together, this allows me to see the big picture of everything and from that point I may start filtering. Also, removing genes so early in the analysis restricts your ability to determine the quality of your replicates b/c you have less points to reference. 

There are also various methods of coming up with values at each time point: the standard is to avg each of the replicates, you can also run an outlyer elimination algorithm for the time points, and thirdly (for time course experiments), I have been testing out a way of using a loess method to use information from each of the time points to calculate a spline to come up with a value. 

After you have a single value for each of the time points for each gene, the next logical step would be to calculate fold change values between each of the points. Also, with 3 replicates you can also calculate the p-values between each of the time points (however you should keep in mind that p-values are not a better indication of what is goin on compared to fold change up until you have at least >8 replicates for each time point (from terry speed's website). 

For some of my recent time course analyses, I have found fold change vs fold change plots very informative. Plotting the various combinations allows you to see what genes are being differently expressed between the time points. Other plots that are informative are MvsA plots (log fold change vs avg expression value), as well as volcano plots (fc vs pvalues). 

At this point I make various lists of genes based on the various plots, and then highlight these lists in related graphs. (I use an in-house method of plotting gene lists i.e. b-cell related genes, nk cell related genes onto microarray plots). This allows me to combine the biology of pathways with this type of microarray analysis. At this point, people tend to spend a lot of time researching the gene lists on pubmed using unigene, and locus link ids and try to create a picture of what is going on. We also create random data sets based on the microarry data to use as a reference point as a confirmation of our results.  

I hope this helps, 

Richard Park 
Computational Data Analyzer
Joslin Diabetes Center 


-----Original Message-----
From: Adaikalavan Ramasamy [mailto:gisar at nus.edu.sg]
Sent: Tuesday, June 03, 2003 6:36 AM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Affy vs. cDNA : Low and not expressed genes


Dear all,

Thank you for the very interesting discussion on the topic of
"replicates and low expression levels" in the last few day. I am facing
a related problem regarding normalization and would appreciate any
advice.

A small time course experiment was done on blood macrophages and
hybridized to affymetrix chip HGU-133A. There were 3 replicates and 3
time points (0, 2, 48 hour).

The main problem is that at time point 0 hour, there are 95 % Absent
calls. The percentage of Absent call decreases to 70% in 2 hour and 20%
in 2 days. Initially I assumed that there was some physical problem with
the array. But later I was corrected by the biologists that it was
expected as many genes are not expressed in blood macrophages. Thus most
of the 95 % Absent were due to not expressed genes ... Apparently this
is common in developmental biology.

My first question is how does one normalize this kind of data ? The
assumption in two-colour cDNA data of "most of the genes are not
differently expressed" does not hold here. Median normalization would
not be meaningful in this scenario.

We then explored the possibility of using housekeeping genes for
normalization. But it seems that the 100 housekeeping genes for HGU-133A
are standard and not specified for our experiment. This is because only
28 of these 100 genes are expressed through out all time points.


The biologists have decided to re-do the experiment again and I think
they are more likely to hear our advice BEFORE doing the experiments. My
second question is this: Will 2-colour cDNA with UHR as reference
overcome this problem ?

Now I would expect to see most of the un-expressed genes (and previously
Absent in affy) to have very negative log ratio values. But I don't
think the assumption of "most genes are not differentially expressed"
will hold again. And how does one deal with this ... 

My last question is has anyone done a comparison of Affymetrix to cDNA
results/efficiency/advantages. I am interested in quantifying the
benefits of spending 5 times as much money on something that has
typically 40% absent calls. Thank you very much in advance.

Regards, Adai.

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor



More information about the Bioconductor mailing list