[R] Confused - better empirical results with error in data

Noah Silverman noah at smartmediacorp.com
Mon Sep 7 22:39:19 CEST 2009


You both make good points.

Ideally, it would be nice to know WHY it works.

Without digging into too much verbiage, the system is designed to 
predict the outcome of certain events.  The "broken" model predicts 
outcomes correctly much more frequently than one with the broken data 
withheld. So, to answer Mark's question, we say it's "better" because we 
see much better results with our "broken" model when applied to 
real-world data used for testing.

I have one theory.

The data is listed in our CSV file from newest to oldest.  We are 
supposed to calculated a valued that is an "average" of some items.  We 
loop through some queries to our database and increment two variables - 
$total_found and $total_score.  The final value is simply $total_score / 
$total_found.

Our programmer forgot to reset both $total_score and $total_found back 
to zero for each record we process.  So both grow.

I think that this may, in a way, be some warped form of a recency 
weighted score.  The newer records will have a score more affected by 
their "contribution" to the wrongly growing totals.  A record that is 
closer to the end of the data set will be starting with HUGE values for 
$total_score and $total_found, so addition of its values will have very 
little effect.

We've done the following so far today  (Note, scores are just relative 
to indicate performance. Higher is better)
1) Run with "bad" data = 6.9
2) Run with "bad" data missing = 5.5
3) Run with "correct" data = ?? (We're running now, will take a few 
hours to compute.)


I might also try to plot the bad data.  It would be interesting to see 
what shape it has...










On 9/7/09 1:05 PM, Mark Knecht wrote:
> On Mon, Sep 7, 2009 at 12:33 PM, Noah Silverman<noah at smartmediacorp.com>  wrote:
> <SNIP
>
>> So, this is really a philosophical question.  Do we:
>>     1) Shrug and say, "who cares", the SVM figured it out and likes that bad
>> data item for some inexplicable reason
>>     2) Tear into the math and try to figure out WHY the SVM is predicting
>> more accurately
>>
>> Any opinions??
>>
>> Thanks!
>>
>>
> Boy, I'd sure think you'd want to know why it worked with the 'wrong'
> calculations. It's not that the math is wrong, really, but rather that
> it wasn't what you thought it was. I cannot see why you wouldn't want
> to know why this mistake helped. Won't future project benefit?
>
> Just my 2 cents,
> Mark
>




More information about the R-help mailing list