[R] HELP! Excel and R give me totally different regression results using the exact same data

Wed Nov 7 21:43:54 CET 2012

On Nov 7, 2012, at 11:47 AM, frauke wrote:

> Okay. Sorry for being vague in my earlier message. I had missed a few lines
> from your message because they were hiding well in my own email. I am really
> on the learning side with this, so it will take some time. Sorry.
> 
> There seem to be two issues: (1) Me preparing the data incorrectly and (2)
> the data not being fit for regression. Right?

Well. the second point might be more correctly stated that the data do not meet the conditions for valid inference using linear regression. Since the goals of the exercise have never been stated, it is difficult to say whether other regression methods migh be more applicable.

> 
> Ad1. Point about header taken. As to using characters in a matrix, I extract
> the data from data files from the National Weather Service. I extract
> observations together with dates and location names. Each row comes consists
> of date, location and observations.  I chose to store them in matrices
> because I can combine them to arrays. A matrix can only have one type of
> data, so I chose to leave them all as characters.

That is generally the reason people use data.frames.

> When I proceed to do a
> regression analysis I transform the observations  into numbers using
> as.numeric(). Do you have a different suggestion? Will R give me different
> results if I store characters in a matrix?

It shouldn't, but it seems unnecessarily convoluted and prone to errors.

> Even though such excerpts from a long script aren't very informative, to be
> complete:
> collection <- matrix(rep(NA,25),ncol=25)        #collection will be a row of
> the output matrix later on. 
> #extract dates
> 
> collection[1] < -paste(year,"/",  substring(.file,125,126), "/", substring(.file, 127, 128), sep="")

That is only going to change the first element of 'collection'. You should study the help page for "[". If you were changing the first column it would need to be a different call on the LHS.

> #extract observations
>            collection[start.write+i]<-(substring(input , fields[[i]][1] ,
> fields[[i]][2]))

Again, possibly not what you thought you were doing.Lack of context prevents further analysis.

> 
> Ad2.  You mention heteroscedasticity and non-normality of residuals. To keep
> it short I had provided just a subset of the data I have (100 of 4000 matrix
> rows). But the same is true for the whole dataset. I attached the whole
> thing this time.  test_complete.txt
> <http://r.789695.n4.nabble.com/file/n4648759/test_complete.txt>  How do I
> deal with this?

> str(dat)
'data.frame':	3548 obs. of  5 variables:
 $ V1: num  1.91 1.9 1.93 2.16 1.9 1.87 1.87 2.01 2.8 2.11 ...
 $ V2: num  1.86 1.9 1.91 1.88 1.87 1.88 6.94 2.01 2.03 2.09 ...
 $ V3: num  1.89 1.94 1.9 1.85 1.86 1.88 2.01 2 2.03 2.06 ...
 $ V4: num  1.92 1.96 1.91 1.83 1.85 1.87 2.01 2.03 2.04 2.03 ...
 $ V5: num  2.1 2 1.93 1.92 1.85 1.86 2.02 2.15 2.08 2.03 ...
> lm(V1 ~ ., data=dat)

Call:
lm(formula = V1 ~ ., data = dat)

Coefficients:
(Intercept)           V2           V3           V4           V5  
     0.1291       0.3378       0.2079       0.2635       0.1460  

> summary( lm(V1 ~ ., data=dat))

Call:
lm(formula = V1 ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3116  -0.1825  -0.0304   0.0959  27.0989 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.12906    0.03840   3.361 0.000784 ***
V2           0.33783    0.01768  19.111  < 2e-16 ***
V3           0.20789    0.01686  12.329  < 2e-16 ***
V4           0.26346    0.01784  14.768  < 2e-16 ***
V5           0.14596    0.01672   8.728  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.781 on 3543 degrees of freedom
Multiple R-squared: 0.7693,	Adjusted R-squared: 0.7691 
F-statistic:  2954 on 4 and 3543 DF,  p-value: < 2.2e-16 

> with(dat, plot(V2, V1) )
Hit <Return> to see next plot: 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rplot.png
Type: image/png
Size: 139409 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121107/ecd2057a/attachment-0002.png>
-------------- next part --------------

There appears to be quite a bit of "structure" in that plot.And a rather similar structure in 

with(dat, plot(V3, V1) )

> I admit I am pretty clueless in this case. Can I do
> meaningful regression at all? (I didn't expect test[,3] to be good predictor
> but had hopes for test[,2]. 

What are these data and what are the scientific questions? You appear to think a) I can look over your shoulder and see your display and b) deduce your goals from extremely fragmentary evidence. I have a lower opinion of my ability to accomplish those tasks.

> 
> The residuals are definitely not normally distributed.

Not generally the biggest concern. But again you provide no code. Nabble-users are unfortunately notorious in rhelp for not reading the Posting Guide, and some do not seem even  to understand that rhelp is not Nabble.

> They do not seem to related to either of the two predictors.

Well, that second outcome would be the expected (even the desired) outcome of a regression wouldn't it? You would want the relationships to be in the prediction and the residuals to have zero correlations with 

> What is the conclusion from that? 
> 
> Thanks for your patience!

I'm rapidly running out of patience, however. Please read the PostingGuide more thoroughly than you appear to have done so far.

> --
> View this message in context: http://r.789695.n4.nabble.com/HELP-Excel-and-R-give-me-totally-different-regression-results-using-the-exact-same-data-tp4648648p4648759.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA