[R] Re gression using age and Duration of disease as a continous factors

Marc Schwartz marc_schwartz at me.com
Tue Jul 21 21:31:43 CEST 2009


On Jul 21, 2009, at 11:29 AM, 1Rnwb wrote:
> Thanks Steve,Thanks for the explanation,  I agree the question is  
> too vague,
> I do not what a regression is I have switched to R a couple of  
> months ago,
> after working in Excel for a long time.  I also know the lm, glm  
> functions
> in R. but due to my data I am completely lost.  it looks like the  
> experts
> individuals just come to poke fun at our expesense who has no  
> background of
> statistics.
>
> I have a 8 proteins and I have two groups with 840 samples in  
> control and
> 1140 samples in diseases further stratified by sex, draw age,  
> duration of
> disease. all these groups and sub groups is making the thing very  
> confusing
> as how to do the regression in R. the pupose is to show the changes  
> in the
> levels of these proteins as the disease progress or changes in their  
> levels
> with respect to progression in age, effect of gender, SNPs for these
> proteins, it is a pretty big dataset.
>
> The suggestion that consult the statistician is kind of funny as  the
> statistician in my center is my co-mentor and from past 5 years he is
> sitting on the data without any output.
>
> I am not here to ask someone to do my data analysis, but to get an
> understanding of the process as well as a proper direction to look  
> for the
> analysis.  after all I do have to explain all these things to my  
> boss as
> well.
>
> Thanks

<snip>

First, welcome to R.

Not withstanding other replies, a key issue here is that the specific  
data and analytic domains for which you are querying are not ones that  
can be really learned remotely. These are not "simple" regression  
models and this is certainly not an area that the point and click  
approach of Excel would even begin to address, much less the plethora  
of other criticisms relevant to Excel's use for statistical analysis.

To the question that you pose in the final paragraph above, the proper  
direction for you at this point is to seek out a professional  
statistician with expertise in this particular domain. I would think  
that after 5 years, even your boss would be more comfortable in  
knowing that this was done with the requisite expertise applied.

It sounds like you are a clinical researcher/physician. If your  
current statistician is not in a position to offer assistance after 5  
years, for whatever reason, then as I note above, you need to seek  
another with experience in this domain who can work with you in close  
collaboration on this project. Neither statistician nor clinician  
should work in isolation here. It is the value in collaboration where  
each brings their own respective expertise to the table that results  
in a reasonable result.

The purpose of R's e-mail lists is not to provide general statistical  
consultancy, but to address specific issues as they pertain to R. Your  
initial queries fall into the former. In other words, your questions  
so far focus more on learning what are in fact, quite complex  
statistical methods and insights. That being said, there will be some  
interactions on the lists pertaining to general statistical issues  
when presented with *focused* questions, even though they may not be R  
specific.

The nature of your data suggest that you might benefit from the use of  
tools that have been made available via the Bioconductor project:

   http://www.bioconductor.org

which is built upon R and intended for this domain. There are entire  
books written on this subject in particular and on regression in  
general, some of which have been referenced by others in this thread.  
Bioconductor exists because it address specific needs for analytic  
tools within a statistical subspecialty, that R in general may not.

Just as there are specialties within medicine, they exist within  
statistics. You would not have an orthopaedic surgeon perform a mitral  
valve replacement any more than you would have a cardiac surgeon do a  
hip replacement, even though they are both surgeons, went to medical  
school and share general surgery training. They both went on to  
additional years of study within their specialties, diverging in their  
skills and knowledge base at that point.

The same in this domain.

There are fundamental questions that you will need to address  
regarding the means by which your data have been collected which can  
and will impact how you go about analyzing it. It sounds like this  
dataset may be the result of a retrospective collection process or  
'data of opportunity' rather than a prospective study design.

Do you have serial protein measurements from the same subjects over  
time, or will your time based hypotheses be inferred based upon single  
protein samples from each subject where the subjects happened to be  
available at differing ages and with differing disease duration/ 
progression at the time of data collection?

Why are there not equal sample sizes in your two groups? Does this  
infer sample selection bias that will have to be taken into account?  
What other sources of bias may be present? What other differences in  
the two groups will you have to adjust for? Do key variables other  
than the protein measurements change over time that you will have to  
consider that in turn may influence the protein measures?

What level of missing data is there, is it missing at random  
(unlikely) and how will you account for it?

These are just some questions that I would pose at the outset, without  
knowing more about your data other than what you have posted. If you  
have not posed the same questions and many others to yourself, then it  
further supports your need for a local statistical expert. Just taking  
a dataset and throwing it into a regression model, even one with what  
appears to be a reasonable formulation, without some consideration for  
these issues and many others is not the way to go. The result will not  
be worth the time you put into it and worst case, can be entirely  
misleading in any conclusions inferred.

HTH,

Marc Schwartz




More information about the R-help mailing list