[R] Non-normal data issues in PhD software engineering experiment

Thu Jul 10 16:01:29 CEST 2008

Hi All,

Title: Non-normal data issues in PhD software engineering experiment

I hope I am not breeching any terms of this forum by this rather general
post. There are very R specific elements to this rather long posting.

I will do my best to clearly explain my experiment, goals and problems here
but please let me know if I have left out any vital information or if there
is any ambiguity that I need to address such that you can help me. 

I have a very limited background in statistics - I have just completed a
postgraduate course in Statistics at TCD Dublin, Ireland.

*** Experimental setup ***
I am have conducted a software engineering experiment in which I have taken
measures of quality for a software system build using 2 different design
paradigms (1 and 2) over 10 evolutionary versions of the system (1 - 10). So
for each version I have a pair of systems identical in that they do
precisely the same thing and differ only in that they are build using 2
different design paradigms. 

For each version and paradigm type I have collected a data set of measures
called sensitivity measures. So for instance I have 20 different data sets,
10 for the 10 versions of software under design paradigm 1 and 10 for the 10
versions of software under design paradigm 2.  

*** Data ***

My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv

In this data file there are a number of columns -
"version","paradigm","location","coverage","execution","infection","propogation","sensitivity"

Sensitivity is the main response - please ignore
"coverage","execution","infection","propogation" as these were used to
calculate sensitivity. 

All 20 if my data sets are in this file - the columns version (1 - 10) and
paradigm (1 or 2) differentiate them.

*** Goals ***
With this data collected I now want to do a number of things -

1) I want to look at the analysis of variance so see if there is a
difference in mean for each paradigm over the 10 versions. I want to remove
the version related variance by blocking on version. With this done I should
be able to get a picture of the variance related to paradigm only. My null
hypothesis is that there the means of both data sets are the same. I also
want to look at each data set individually also to see if there is any
difference between each pair of system designs.

2) I want to create two regression models, one for each paradigm to enable
me to see how the quality of each paradigm is effected over time (versions).
It would also be nice to have both confidence and prediction boundaries. 

3) I want to be able to look at the power of all of this and possible see
how many times I would need to do this to have concrete evidence that one
paradigm is different/the same/better/worse than the other.

4) I am not 100% sure if its relevant - but the analysis of divergence
(Something I came across when reading an R book - Introductory statistics
with R - Peter Dalagaard - Springer - p197) may fit what I am looking for to
assess the difference between the two regression models stated in goal
statement 2. I think that this will assess the degree to which the
regression models diverge over time.

*** Problems ***

1)The problem I have is that each of the 20 data sets are of variable size.
These data sets are also not-normal. I have assessed this using the
normality tests (ad.test etc.in R and Mini-tab)  So as far as I understand
it I had two choices - the first is to transform my non-normal data into
normal data. The second is to look at using non-parametric approaches. 

So I tried to use R to conduct a boxcox transformation for each of my 20
data sets. I couldn't figure it out past generating an optimal lambda. I
then turned to mini-tab and found that I could make transformations there -
the problem however was that there was a subset group option I didn't
understand. I set it at various numbers but always seemed to get the same
result so it didn't seem to upset the outcome that much/if at all. The
result of this was non-normal data again. I then turned to the Johnson
transformation and found that that also failed to produce transform my
non-normal data to normal data. 

3) I have looked at the Friedman test as a means of performing two way
analysis of variance to address with my scenario. I have tried to execute it
in R and Mini-tab but cant really cant figure out what my arguments should
be. 

Using R: I then read my data into a frame using "read.table(data)". I
proceed to then with the following - friedman.test( data$sensitivity ~
data$paradigm | data$version, data, data$version, na.action=na.exclude).
This produces the following error "incorrect specification for 'formula'". I
see that my formula needs to be of length == 3 for this test to be used
(https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I
dont think that my formula should be like this even but I wanted to be as
close as possible to the example provided by R.

I then tried to use the kruskal.test as follows -
kruskal.test(data$sensitivity ~ data$sensitivity, data = data,
na.action=na.exclude) - this gave me a result - however there was no account
of the variance between versions.

-- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data =
sensResults, na.action=na.exclude)
--
--	Kruskal-Wallis rank sum test
--
-- data:  data$sensitivity by data$version by data$paradigm 
-- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053

I have no idea if these tests are the right thing to do here? This test is
advertised as a subsitite to one way anova. My instinct tells me that I need
to use the friedman.test - but as you can see I am noting having much luck
with it. I have looked at the code in R as you can see from the link above
and can see where it us rejecting my formula - I just don't understand what
I need to do to my model for it to be accepted.

4) I have looked at the outputs to the kruskal.test and friedman.test and
they differ from the anova table -

By following and executing the R man examples I can see the friedman.test
produces the following output:

-- > friedman.test(x ~ w | t, data = wb)
-- 
-- 	Friedman rank sum test
-- 
-- data:  x and w and t 
-- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637

You can also see from the above point that the output of the kruskal.test
looks similar enough. This is a big contrasts to an anova table. In an anova
table I can see the components of variance and the significant of each F
test. These alternative tests do not seem to provide me this information.

Using Mintab

I go to stats->nonparametrics->Friedman

This prompts me to provide columns for response, treatment and blocks 

I provide the following

response <- sensitivity
treatment <- paradigm 
blocks <- version 

When I try to execute this I get the following error

Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'.

* ERROR * Must have one observation per cell.
* ERROR * Completion of computation impossible.

5) I have looked briefly at the non-parametric approaches to regression -
there seems to be many
(http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html)
paths that can be taken. I need some guidance on which approach I should
follow? What are the tradeoffs? How do I do this?

Thank you and best regards,
Andrew Jackson

-- 
View this message in context: http://www.nabble.com/Non-normal-data-issues-in-PhD-software-engineering-experiment-tp18383175p18383175.html
Sent from the R help mailing list archive at Nabble.com.