[BioC] Illumina Probe_ID used in the LIMMA package for neqc function

Wed Jul 10 07:01:13 CEST 2013

Hi Wei,

Thanks for this. Can I specifically ask want you mean by 'replicates'? 
Is this ALL your microarrays? Or to do with the propexpr function? If I 
filter to keep only those probes that satisfy p<0.05 across ALL samples 
(n=36), I am only left with 11,102 probes.

My understanding is that I should keep those probes that are 
significantly different to background in at least one sample. If I use a 
detection p-value of p<0.05, I get 26,816; compared to p<0.01, I get 
16,877 probes that remain. Based on this, would you suggest I use 
p<0.05? This is approximately half of the original 48,701 probes.

Kind regards,
Wil

On 10/07/2013 9:51 AM, Wei Shi wrote:
> Hi Wil,
>
> You removed about two thirds of your probes, which is pretty high. You may try to use a cutoff of p<0.05 to see how many are filtered out. Typically, around half of probes were filtered out in our analyses. We often use a cutoff of p<0.05 but we also require all the replicates to satisfy this criteria.
>
> You should also check if the p values in your data are 'detection scores' or 'detection p-values'. If they are detection scores, then the low p value means low intensity and you should use p>0.95 for the filtering. You can easily check this by just looking at a few probes.
>
> Cheers,
> We
>
> On Jul 9, 2013, at 5:21 PM, Wil D'Avigdor wrote:
>
>> Hi Wei,
>>
>> For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877.
>>
>> I would like to confirm that this is the suitable threshold for my analyses?
>>
>> Many thanks in advance,
>> Wil
>>
>> Sent from my iPhone
>>
>> On 04/07/2013, at 6:37 PM, Wei Shi <shi at wehi.EDU.AU> wrote:
>>
>>> Hi William,
>>>
>>> Please keep the posts on the list.
>>>
>>> You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details.
>>>
>>> Best wishes,
>>>
>>> Wei
>>>
>>> On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote:
>>>
>>>> Hi Wei,
>>>>
>>>> Many thanks for your response.
>>>>
>>>> I would like to ask you another question, specifically about probe filtering.
>>>>
>>>> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately).
>>>>
>>>> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would.
>>>>
>>>> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR.
>>>>
>>>> Many thanks,
>>>> Wil
>>>>
>>>> On 2/07/2013 7:18 PM, Wei Shi wrote:
>>>>> Dear William,
>>>>>
>>>>> What you have done is correct. As you have found, the 'ProbeID'  is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips.
>>>>>
>>>>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between-array normalization.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Wei
>>>>>
>>>>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG-6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package.
>>>>>>
>>>>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all.
>>>>>>
>>>>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual.
>>>>>>
>>>>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray.
>>>>>>
>>>>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID?
>>>>>>
>>>>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID).
>>>>>>
>>>>>> Many thanks in advance,
>>>>>> William d'Avigdor
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> ______________________________________________________________________
>>>>> The information in this email is confidential and intended solely for the addressee.
>>>>> You must not disclose, forward, print or use it without the permission of the sender.
>>>>> ______________________________________________________________________
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the addressee.
>>> You must not disclose, forward, print or use it without the permission of the sender.
>>> ______________________________________________________________________
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:4}}