[BioC] outlier removal from gene chip

Wed Sep 20 02:38:12 CEST 2006

You should really check the original data, not the ratio, and then 
decide, rather than blindly choosing to use or remove those extreme 
values. As Kasper said, some could well represent genes that show 
strong expresion on one condition only, either because they become 
silenced or activated, and these are potentially very interesting.

Jose

Quoting Weiwei Shi <helprhelp at gmail.com>:

> thanks for all of suggestions here.
>
> i will go w/o removing those "outliers" first and update some result
> if necessary.
>
> On 9/19/06, Kasper Daniel Hansen <khansen at stat.berkeley.edu> wrote:
>>
>> On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote:
>>
>> > my current way is using mahalanobis() distance.
>> >
>> > to Sean:
>> > do u think that example: -14k is ok?
>>
>> That example could be a case of the gene being expressed in one
>> condition and not being expressed in another. I do not remember where
>> the data are from (or if you have even described that) or platform
>> or ..., but I would agree with Sean and say that you do not want to
>> blindly remove the genes. Note that we are not advising that you
>> shouldn't remove the gene, just that you should take a careful look
>> at the data and try to decide what to do.
>>
>> As Fangxin clearly writes, it is hard to really know what is an outlier.
>>
>> Kasper
>>
>>
>> >
>> > On 9/19/06, fhong at salk.edu <fhong at salk.edu> wrote:
>> >> Dear Weiwei,
>> >> The definition of outlier is not clear, and no data point should be
>> >> treated as outlier unless there is reason to believe so. The
>> >> simple way to
>> >> detect it is that 1.5IQR criteria, which you can write your own
>> >> code (one
>> >> or two lines). Update me if there are any other method to detect
>> >> outliers.
>> >>
>> >> Fangxin
>> >>
>> >>
>> >>> dear listers:
>> >>>
>> >>> I have a question on whether bioconductor has some tool-kit to
>> >>> detect
>> >>> outliers and remove them.
>> >>>
>> >>> my original dataset looks like this:
>> >>>             V1       V51       V53        V55       V57
>> >>> 1   -493249600  1.459459 -3.069444  -1.300000  1.935484
>> >>> 2  -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
>> >>> 3   1626196571 -3.500000 -1.011662   2.223881  3.921053
>> >>> 4  -1397009217 -3.571429  1.685714  -1.180297 -6.807692
>> >>> 5   1428659728 -1.405405 -1.469004  -4.779754 -1.033708
>> >>> 6    459853658 -2.158879 -7.510823  -1.085581 -9.382979
>> >>> 7    530182506 -1.431677 -1.336343  -3.126437  4.878788
>> >>> 8   1173842263  1.215385  1.856410  -2.059794 -6.020833
>> >>> 9        28847  2.407895 -2.048889  -1.730337 -1.178947
>> >>> 10 -1961875610  2.864159 -2.301234  -4.733264 -1.172058
>> >>>
>> >>> V1: internal probe id
>> >>> the rests are different samples. the cells are fold-change of
>> >>> disease/normal.
>> >>>
>> >>> summary of the sample columns( V51, ... V57) gives the following:
>> >>>       V51                V53                 V55                V57
>> >>>  Min.   :-482.000   Min.   : -55.7342   Min.   :-122.074   Min.
>> >>> :-14086.750
>> >>>  1st Qu.:  -2.159   1st Qu.:  -1.7312   1st Qu.:  -2.125   1st Qu.:
>> >>> -1.831
>> >>>  Median :  -1.199   Median :  -1.0416   Median :  -1.200   Median :
>> >>> -1.080
>> >>>  Mean   :  -0.918   Mean   :   0.1662   Mean   :  -1.027   Mean   :
>> >>> -1.874
>> >>>  3rd Qu.:   1.441   3rd Qu.:   1.5721   3rd Qu.:   1.419   3rd Qu.:
>> >>> 1.521
>> >>>  Max.   : 198.434   Max.   :1478.1639   Max.   :  95.768   Max.   :
>> >>> 683.519
>> >>>
>> >>>
>> >>> My question is, is there any package which can detect those outliers
>> >>> (like -14086.750)and remove them and get an "average" for each gene
>> >>> (instead of each probe)?
>> >>>
>> >>> Thank you.
>> >>>
>> >>> Weiwei
>> >>>
>> >>> --
>> >>> Weiwei Shi, Ph.D
>> >>> Research Scientist
>> >>> GeneGO, Inc.
>> >>>
>> >>> "Did you always know?"
>> >>> "No, I did not. But I believed..."
>> >>> ---Matrix III
>> >>>
>> >>> _______________________________________________
>> >>> Bioconductor mailing list
>> >>> Bioconductor at stat.math.ethz.ch
>> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >>> Search the archives:
>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>>
>> >>>
>> >>
>> >>
>> >> --------------------
>> >> Fangxin Hong  Ph.D.
>> >> Plant Biology Laboratory
>> >> The Salk Institute
>> >> 10010 N. Torrey Pines Rd.
>> >> La Jolla, CA 92037
>> >> E-mail: fhong at salk.edu
>> >> (Phone): 858-453-4100 ext 1105
>> >>
>> >>
>> >
>> >
>> > --
>> > Weiwei Shi, Ph.D
>> > Research Scientist
>> > GeneGO, Inc.
>> >
>> > "Did you always know?"
>> > "No, I did not. But I believed..."
>> > ---Matrix III
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives: http://news.gmane.org/
>> > gmane.science.biology.informatics.conductor
>>
>>
>
>
> --
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology        Fax:   +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK