[R] Request for functions to calculate correlated factors influencing an outcome.

Mon May 4 15:55:16 CEST 2015

This would be better posted on a statistical list like
stats.stackexchange.com, as it is largely about statistical
methodology, not R code. Once you have determined what kinds of
methods you want, you might then post back here -- or better yet, just
search! -- for packages that implement those methods in R.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Mon, May 4, 2015 at 1:40 AM, Lalitha Viswanathan
<lalitha.viswanathan79 at gmail.com> wrote:
> Hi
> I used the MASS library
> library(MASS)  (by reading about examples at
> http://www.statmethods.net/stats/regression.html
> <http://s.bl-1.com/h/ofLlK27?url=http://www.statmethods.net/stats/regression.html>
> )
> fit <- lm(Mileage~Disp+HP+Weight+Reliability,data=newx)
> step <- stepAIC(fit, direction="both")
> step$anova # display results
>
> It showed the most relevant variables affecting Mileage.
> While that is a start, I am looking for a model that fits the entire data
> (including Mileage), not factors that influence Mileage.
>
> Multi model inference / selection.
>
> I was reading about glmulti.
> Are there any other packages I could look at, for infering models that best
> fit the data.
>
> To use nlm / nls, I need a formula, as one of the parameters to best fit
> the data and I am looking for functions that will help infer that formula
> from the data.
>
> Thanks
> lalitha
>
> On Sun, May 3, 2015 at 11:33 PM, Prashant Sethi <theseth.prashant at gmail.com>
> wrote:
>
>> Hi,
>>
>> I'm not an expert in data analysis (a beginner still learning tricks of
>> the trade) but I believe in your case since you're trying to determine the
>> correlation of a dependent variable with a number of factor variables, you
>> should try doing the regression analysis of your model. The function you'll
>> use for that is the lm() function. You can use the forward building or the
>> backward elimination method to build your model with the most critical
>> factors included.
>>
>> Maybe you can give it a try.
>>
>> Thanks and regards,
>> Prashant Sethi
>> On 3 May 2015 23:18, "Lalitha Viswanathan" <
>> lalitha.viswanathan79 at gmail.com> wrote:
>>
>>> Hi
>>> I am sorry, I saved the file removing the dot after the Disp (as I was
>>> going wrong on a read.delim which threw an error about !header, etc...The
>>> dot was not the culprit, but I continued to leave it out.
>>> Let me paste the full code here.
>>> x<-read.table("/Users/Documents/StatsTest/fuelEfficiency.txt",
>>> header=TRUE,
>>> sep="\t")
>>> x<-data.frame(x)
>>> for (i in unique(x$Country)) { print (i); y <- subset(x, x$Country == i);
>>> print(y); }
>>> newx <- subset (x, select = c(Price, Reliability, Mileage, Weight, Disp,
>>> HP))
>>> cor(newx, method="pearson")
>>> my.cor <-cor.test(newx$Weight, newx$Price, method="spearman")
>>> my.cor <-cor.test(newx$Weight, newx$HP, method="spearman")
>>> my.cor <-cor.test(newx$Disp, newx$HP, method="spearman")
>>> Putting exact=NULL still doesn't remove the warning
>>> my.cor <-cor.test(newx$Disp, newx$HP, method="kendall", exact=NULL)
>>> I tried to find the correlation coeff for a various combination of
>>> variables, but am unable to interpet the results. (Results pasted below in
>>> an earlier post)
>>>
>>> Followed it up with a normality test
>>> shapiro.test(newx$Disp)
>>> shapiro.test(newx$HP)
>>>
>>> Then decided to do a kruskal.test(newx)
>>> with the result
>>> Kruskal-Wallis chi-squared = 328.94, df = 5, p-value < 2.2e-16
>>>
>>> Question is : I am trying to find factors influencing efficiency (in this
>>> case mileage)
>>>
>>> What are the range of functions / examples I should be looking at, to find
>>> a factor or combination of factors influencing efficiency?
>>>
>>> Any pointers will be helpful
>>>
>>> Thanks
>>> Lalitha
>>>
>>> On Sun, May 3, 2015 at 2:49 PM, Lalitha Viswanathan <
>>> lalitha.viswanathan79 at gmail.com> wrote:
>>>
>>> > Hi
>>> > I have a dataset of the type attached.
>>> > Here's my code thus far.
>>> > dataset <-data.frame(read.delim("data", sep="\t", header=TRUE));
>>> > newData<-subset(dataset, select = c(Price, Reliability, Mileage, Weight,
>>> > Disp, HP));
>>> > cor(newData, method="pearson");
>>> > Results are
>>> >                  Price Reliability    Mileage     Weight       Disp
>>> >   HP
>>> > Price        1.0000000          NA -0.6537541  0.7017999  0.4856769
>>> >  0.6536433
>>> > Reliability         NA           1         NA         NA         NA
>>> >   NA
>>> > Mileage     -0.6537541          NA  1.0000000 -0.8478541 -0.6931928
>>> > -0.6667146
>>> > Weight       0.7017999          NA -0.8478541  1.0000000  0.8032804
>>> >  0.7629322
>>> > Disp         0.4856769          NA -0.6931928  0.8032804  1.0000000
>>> >  0.8181881
>>> > HP           0.6536433          NA -0.6667146  0.7629322  0.8181881
>>> >  1.0000000
>>> >
>>> > It appears that Wt and Price, Wt and Disp, Wt and HP, Disp and HP, HP
>>> and
>>> > Price are strongly correlated.
>>> > To find the statistical significance,
>>> > I am trying  sample.correln<-cor.test(newData$Disp, newData$HP,
>>> > method="kendall", exact=NULL)
>>> > Kendall's rank correlation tau
>>> >
>>> > data:  newx$Disp and newx$HP
>>> > z = 7.2192, p-value = 5.229e-13
>>> > alternative hypothesis: true tau is not equal to 0
>>> > sample estimates:
>>> >       tau
>>> > 0.6563871
>>> >
>>> > If I try the same with
>>> > sample.correln<-cor.test(newData$Disp, newData$HP, method="pearson",
>>> > exact=NULL)
>>> > I get Warning message:
>>> > In cor.test.default(newx$Disp, newx$HP, method = "spearman", exact =
>>> NULL)
>>> > :
>>> >   Cannot compute exact p-value with ties
>>> > > sample.correln
>>> >
>>> > Spearman's rank correlation rho
>>> >
>>> > data:  newx$Disp and newx$HP
>>> > S = 5716.8, p-value < 2.2e-16
>>> > alternative hypothesis: true rho is not equal to 0
>>> > sample estimates:
>>> >       rho
>>> > 0.8411566
>>> >
>>> > I am not sure how to interpret these values.
>>> > Basically, I am trying to figure out which combination of factors
>>> > influences efficiency.
>>> >
>>> > Thanks
>>> > Lalitha
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.