[BioC] quality assessment and preprocessing for tiling array-based CGH data

Sean Davis sdavis2 at mail.nih.gov
Wed Oct 22 16:46:24 CEST 2008


On Wed, Oct 22, 2008 at 10:32 AM, Leon Yee <yee.leon at gmail.com> wrote:
> Sean Davis wrote:
>>
>> On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at gmail.com> wrote:
>>>
>>> Dear all,
>>>
>>>   Is there any well-established routine for quality assessment and
>>> preprocessing of array CGH data, especially tiling array-based CGH data?
>>> I
>>> found most of the quality assessment of array data are about expression
>>> array, while few are related to array CGH data.
>>>   We are using agilent 244k CGH array of rat, and now we have the text
>>> files produced by Feature Extraction, don't know whether they are of good
>>> quality. Could anyone help provide some clues? Thanks in advance!
>>>
>>>   After read.maimage(), we got the RGlist object, which contain several
>>> components including R, G, Rb, Gb, and so on.  The probes are of 3 types:
>>> -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and
>>> the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding
>>> probe
>>> sequence]; 1 means positive control, i guess, and the probe names are
>>> like
>>> DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no
>>> corresponding probe sequence].  The number of -1 is 1275, while the
>>> number
>>> of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these
>>> values for quality assessment and normalization? How?
>>>   In addition, in the normal probes, we have 1000 probes repeating 3
>>> times
>>> in the array. How could we use these data for quality assessment and
>>> normalization?
>>
>> You generally will not want to do any normalization besides a possible
>> shift of the center.  Any linear normalization that affects the slope
>> of the M vs. A plot or nonlinear normalization will likely decrease
>> signal.  As for quality control, a good, general measure to track is
>> the dlrs, a robust measure of the standard deviation.
>>
>>
>> dlrs <-
>>  function(x) {
>>    nx <- length(x)
>>    if (nx<3) {
>>      stop("Vector length>2 needed for computation")
>>    }
>>    tmp <- embed(x,2)
>>    diffs <- tmp[,2]-tmp[,1]
>>    dlrs <- IQR(diffs)/(sqrt(2)*1.34)
>>    return(dlrs)
>>  }
>>
>> For agilent arrays, most of the dlrs should be around or under 0.2,
>> generally.  However, this might vary a bit based on lab-to-lab
>> variation.  In any case, if there is a significant outlier, that is
>> suspect.  The input to the above function is the log ratios for a
>> single array arranged in chromosome and position order.
>>
>> Sean
>>
>
> Hi, Sean
>
>   Thanks for your advice. However, I have still several questions:
>
>   1. The input of dlrs is the log ratios, the log ration extracted from the
> text file produced by Feature Extraction? or calculated from RGlist -->
> MAlist ?  I have searched the mailist and seen a post of you mentioned the
> difference of log ration from Feature Extraction and the default M value
> from read.maimages.

You can read the Agilent FE manual for more details, but you can
probably use either and come to very similar conclusions.  If you use
the MAlist version, make sure to use only median centering or none for
normalization.

>   2. I can get the log ratios of all features including control type of -1
> and 1, but these features don't have chromosome positions, does this mean I
> don't need all of them for quality assessment?

We have not routinely used these probes, no.  If an array fails
miserably, then these control probes might be useful for determining
the reason for the failure, though.

>   3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not get a
> proper mapping on the chromosome, so I should remove these values from the
> input of dlrs. Is it so?

You can either remove them or treat chr2_random as a separate chromosome.

>   4. How could I handle those 1000 probes repeating 3 times?  They will be
> mapped on the same chromosome position by three per group.

You could choose one at random or use a mean or median of them.  My
guess is that they agree very closely with one another so the choice
should not affect the results much.

Sean



More information about the Bioconductor mailing list