[BioC] Filtering out duplicate probes in Affy data

Himanshu Sharma hsharm03 at students.poly.edu
Sat Jan 12 16:27:42 CET 2013


Thanks a lot James. I really appreciate your help. Also, when I annotate the ids, there should be equal number of probes as after filtering.? How do I lose more when I annotate them?.
Thanks,
Himanshu
From: James W. MacDonald [jmacdon at uw.edu]
Sent: Saturday, January 12, 2013 9:48 AM
To: Himanshu Sharma
Cc: bioconductor at r-project.org mailman
Subject: Re: [BioC] Filtering out duplicate probes in Affy data

Hi Himanshu,

On 1/11/2013 4:57 PM, Himanshu Sharma wrote:
> Dear List,
> I have a set of mouse affy data. They platform is Affy mouse 430a2 chip.
> There are 8 samples , 4 for each condition.
> I normalized the data using rma. The array has 22090 probes originally.
> Then, in order to filter out the genes which have no entrez id, are duplicates for the same gene, I used the following command .
>
> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func=IQR,var.filter=T)

You are filtering on three things here. First you require that all
probesets have an Entrez Gene ID, then you remove any duplicates, then
you require that the inter quartile range of the remaining data be
greater than 0.5.

This is one way of doing things. Depending on your goals, there may be
better or worse things you could do, but that depends on your goals. If
for instance you don't want to lose DAX, regardless of possible low
variation, you could not filter on variation.

But 'better' is a subjective term, and you are the only one who can
decide what is better or worse in your particular situation.

>
> This leaves me with 6579 genes after filtering. I think I loose many of the genes here. Is there a better way to do the same?.
>
> Also, the other problem that I am facing is that after this step, I create a expression matrix of these remaining 6579 probes.
>
> Now, in order to annotate them, I use the library mouse4302.db
> I select the ids from my list and then use the following command
> Symbol<- mouse4302SYMBOL[ids]
>
> This gives me a lesser number of probes and genes. I loose more data here.
> For example, I am interested in the gene DCK, I check the original annotation file of affymetrix and there are 3 probes that are present for this gene. That means that it should have annotation. But in the final dataset I do not find it.
>
> Can anyone suggest a better method or any corrections to the approach that I am using. I eventually need to merge this data with other data from affy and check for the expression values but, i figured out that I am  not getting the right amount of genes.

There is no such thing as 'the right amount of genes'. There are only
assumptions and tradeoffs. You can make the assumption that genes with
an IQR < 0.5 are not really changing enough to consider, and then filter
them out. Or you can assume that smaller variation is still biologically
meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you can
assume that duplicated genes on the Affy Mouse 430 chip are really
measuring different splice variants or some such, and you want to keep
them all in the data set.

All these assumptions have tradeoffs, including the possibility that you
are wrong and you are polluting your dataset with noise, or
unnecessarily increasing the multiplicity of your comparisons. But in
the end it is up to the analyst to decide what assumptions are to be
made, and to be prepared to defend those assumptions to those higher up
(your PI, your funding source, journal reviewers, whomever).

Best,

Jim


>
> Any help is much appreciated. I am a newbie to R and Biconductor, so I am sorry if it is a basic question.
> Thank you all in advance for your help.
> Thanks,
> Himanshu
>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list