[R] detecting noise in data?

HARROLD, Tim THARR at doh.health.nsw.gov.au
Tue Jan 24 23:55:11 CET 2012


You might want to provide an example? It's a pretty vague problem at the moment.

If the data can be easily picked out by human eyes, you might want to think about your criteria you're using to pick out a contaminated result. If you can express it in such a way that you don't need to scan each observation (e.g. if a snapper weighs >= 300000kg then somebody entered that data incorrectly) then you can create an indicator variable and continue with your analysis.

Other than that - some sort of cluster analysis might be able to pick up on 2 distinct groups provided within each group there's a reasonable level of homogeneity. Then from there, you can do a basic inference test for group means to detect whether there are significant differences detected between groups.

Cheers,
Tim



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michael
Sent: Wednesday, 25 January 2012 9:31 AM
To: r-help
Subject: Re: [R] detecting noise in data?

Hi all,

I just wanted to add that I am looking for a solution that's in R ... to
handle this...

And also, in a given sample, the correct data are of the majority and the
noise are of the minority.

Thank you!

On Tue, Jan 24, 2012 at 4:09 PM, Michael <comtech.usa at gmail.com> wrote:

> Hi all,
>
> I have data which are unfortuantely comtaminated by noise.
>
> We knew that the noise is at different level than the correct data, i.e.
> the noise data can be easily picked out by human eyes.
>
> It looks as if there are two people that generated the two very different
> data with different mean levels, and they got mixed together.
>
> i.e. assming the two data are following unknown distribution DF,
>
> and the two mean levels are u1 and u2... (unknown)
>
> Then the correct data are generated by DF(u1)
>
> and the noise are generated by DF(u2),
>
> and they got mixed...
>
> Now, how do I flag those suspicious data? At least is there a way I could
> answer the question:
>
> Given a sample of mixed data - are these data generated from the
> above-mentioned two sources, or the data are indeed generated from one
> source only.
>
> i.e. are there two substantially distinct species in the given data?
>
> Thanks a lot!
>
>

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________________________________________________________________________________
This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. 
Emails and attachments are monitored to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.
______________________________________________________________________________________________________________________


______________________________________________________________________________________________________________________
Disclaimer: This message is intended for the addressee named and may contain confidential information. 
If you are not the intended recipient, please delete it and notify the sender. 
Views expressed in this message are those of the individual sender, and are not necessarily the views of the NSW Ministry of Health.
______________________________________________________________________________________________________________________
This email has been scanned for the NSW Ministry of Health by the Websense Hosted Email Security System. 
Emails and attachments are monitored to ensure compliance with the NSW Ministry of Health's Electronic Messaging Policy.



More information about the R-help mailing list