[R] NMDS with missing data?

Mon Jun 17 22:17:58 CEST 2013

Principal components analysis and factor analysis are two techniques
that have different histories, but overlap in the computational
procedures used. Strictly speaking, principal components is a
descriptive procedure used to project a multivariate data set into a
space with fewer dimensions. The first principal component is the
direction of maximum covariance (or correlation) through the data
cloud. The second is the direction of the next highest covariance
(correlation) that is also uncorrelated with the previous component,
etc. The principal component loadings indicate generally which
variables are important in defining each component. 

Factor analysis attempts to discover latent variables under the
assumption that the measured variables are "caused" by unobservable
factors and the correlations between the observed variables provide
evidence of these latent variables. Factors are often initially
extracted using principal components analysis and then rotated so
that they are more interpretable. The rotation tries to create
factors with either very low or very high loadings for each
variable. Psychologists generally come from a factor analysis
background and tend to prefer rotated factors. Researchers using
principal components to simplify their data to look for clusters or
other patterns prefer to keep the original components since they
reflect the covariance structure of the data in a way that is lost
by rotation.

Your latest post suggests that you are planning to use the
components in a regression analysis - hence as latent variables.
Rotation may make it easier to interpret those components.
Multidimensional scaling will give you something analogous to
unrotated principal components but you do not get loadings so you
have no easy way to relate the MDS dimensions back to the original
variables (although you could run correlations between the original
variables and the mds dimensions to get similar information). 

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

From: Elizabeth Beck [mailto:elizabethbeck0 at gmail.com] 
Sent: Monday, June 17, 2013 1:43 PM
To: Bert Gunter
Cc: David Carlson; r-help at r-project.org
Subject: Re: [R] NMDS with missing data?

Hi Bert & David - 

I'm putting aside the issues with the missing data for the moment -
the NAs are due to not enough sample volume for testing and there
are only about 6 of them for 1 variable. I have multiple data sets
to look and not all with missing values. I do intend to find some
local consulting options once I have a bit more of a grasp on my
options. 

If I were to stick with the principal() function using my
standardized variables...rotate=none would make sense initially,
although several papers I have read with very similar data sets have
used a Varimax factor rotation (orthogonal transformation). 

My reasoning behind the PCA is to reduce the number of variables (as
many are likely correlated) and then use those new factors to run a
perMANOVA. All of my categorical factors are explanatory variables
(sex, exposure, treatment) so will be used in the final model. 

Is PCA still the preferred ordination method for this type of data?
Are there advantages to NMDS instead? 

I appreciate the input...
Elizabeth

On Mon, Jun 17, 2013 at 12:28 PM, Bert Gunter
<gunter.berton at gene.com> wrote:
David et. al.:

I hate to be a pest but ...

On Mon, Jun 17, 2013 at 11:02 AM, David Carlson <dcarlson at tamu.edu>
wrote:
> First, Bert is correct. I should have said to use prcomp(dat,
center=TRUE, scale=TRUE). That will run the svd on the standardized
variables which is equivalent to using princomp(dat, cor=TRUE).

***You will have to remove the cases with missing variables or
impute
the missing variables using one of many options in R. ***

Depending on the number of missings and nature of the missingness,
this can be a crucial issue. Omitting all data with missing entries
makes very strong assumptions about the nature of the missingness
and
can lead to highly biased results. Which is problematic for
exploration, even. The same is true with imputation -- you need to
do
it properly. Again, depending on the number of cases at issue.

So it may be wise for Elizabeth to consult a local statistical
expert
and not rely on superficial background from a text and remote
advice.
There may be dragons ...

Cheers,
Bert

>
> The principal() function in package psych should be fine and will
probably give nearly identical results. It does have the ability to
generate a pairwise-deletion correlation matrix so you could include
your cases with missing values. I would set rotate="none" least
initially. Hopefully your text will explain why this is a good idea.
>
> I assume you are looking for interesting patterns in the data
rather than trying to test a specific hypothesis. Given that, you
should try both (or all three with principal()) and see if there are
any interesting differences between them.
>
> Earlier I asked if all your variables are numeric (or
dichotomies). If any are categorical (factors), these suggestions
may have to be revised.
>
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
>
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Monday, June 17, 2013 12:35 PM
> To: Elizabeth Beck
> Cc: David Carlson; r-help at r-project.org
> Subject: Re: [R] NMDS with missing data?
>
> Just wanted to note that one does **not** use
> "prcomp() on the correlation matrix of the variables."
>
> As ?prcomp says, it uses the svd of the data matrix, which is
> generally preferable.
>
> Cheers,
> Bert
>
> On Mon, Jun 17, 2013 at 10:02 AM, Elizabeth Beck
> <elizabethbeck0 at gmail.com> wrote:
>> Hello David,
>>
>> Yes my variables are all numeric....I have a few questions
regarding your 2
>> options.
>>
>> Would these still be the best options if missing data was not an
issue? I
>> was told that I should be performing NMDS as it has few
assumptions on the
>> data distribution but neither of your options use this.
>>
>> If NMDS is not preferred and I were to perform a PCA, can you
tell me why
>> you chose prcomp()? My statistical text (Discovering Statistics
Using R)
>> explains PCA quite well using principal() in the psych package so
I am just
>> wondering the advantages of one over the other... I am
overwhelmed by the
>> number of ordination methods!
>>
>> Thank you,
>> Elizabeth
>>
>> On Mon, May 13, 2013 at 9:44 AM, David Carlson
<dcarlson at tamu.edu> wrote:
>>
>>> First. Do not use html messages. They are converted to plain
text and your
>>> table ends up a mess. See below. It appears the variables are
all numeric?
>>> If so, there are two standard approaches to handling multiple
scales and
>>> magnitudes with cluster analysis:
>>>
>>> 1. Use z-scores. The scale() function will convert each variable
into a
>>> standard score with a mean of 0 and a standard deviation of 1.
Then use
>>> Euclidean distance in the dist() function which will adjust for
your
>>> missing
>>> values.
>>>
>>> 2. Use prcomp() on the correlation matrix of the variables to
extract a set
>>> of principal components and use the principal component scores
in the
>>> cluster analysis. This may allow you to reduce the number of
variables in
>>> the data set if the 29 variables are correlated with one
another.
>>>
>>> -------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77840-4352
>>>
>>> From: Elizabeth Beck [mailto:elizabethbeck0 at gmail.com]
>>> Sent: Friday, May 10, 2013 1:20 PM
>>> To: dcarlson at tamu.edu
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] NMDS with missing data?
>>>
>>> Hi David,
>>>
>>> You are right in that Bray-Curtis is not suitable for my
dataset, and that
>>> my variables are very different. Given your suggestions, I am
struggling
>>> with how to transform or standardize my data given that they
vary so much.
>>> Additionally, looking at the dist() package I am not sure which
distance
>>> measure would be most appropriate. Euclidean seems to most
widely used but
>>> I'm not sure if it is appropriate for myself (there much more
help for
>>> ecology data than toxicology). Given a sample of my data below (
total of
>>> 287 obs. of  29 variables) can you suggest a starting point?
>>>
>>> SODIUM
>>> K
>>> CL
>>> HCO3
>>> ANION
>>> CA
>>> P
>>> GLUCOSE
>>>  CHOLEST
>>>        GGT
>>>    GLDH
>>> CK
>>> AST
>>> PROTEIN
>>> ALBUMIN
>>> GLOBULIN
>>> A_G
>>> UA
>>> BA
>>> CORTICO
>>> T3
>>> T4
>>> THYROID
>>> 145
>>> 3.3
>>> 102
>>> 24
>>> 22
>>> 2.9
>>> 2.45
>>> 9.8
>>> 5.7
>>> 3
>>> 3
>>> 678
>>> 5
>>> 34
>>> 15
>>> 19
>>> 0.79
>>> 180
>>> 6
>>> 70.97
>>> 1.31
>>> 12.77
>>> 0.102376
>>> 146
>>> 3.2
>>> 102
>>> 21
>>> 26
>>> 2.89
>>> 2.68
>>> 11.1
>>> 6.78
>>> 3
>>> 4
>>> 1290
>>> 9
>>> 36
>>> 18
>>> 18
>>> 1
>>> 170
>>> 13
>>> 79.1
>>> 3.51
>>> 18.78
>>> 0.186751
>>> 147
>>> 2.5
>>> 103
>>> 22
>>> 25
>>> 2.96
>>> 2.59
>>> 10
>>> 5.78
>>> 3
>>> 6
>>> 1582
>>> 11
>>> 35
>>> 17
>>> 18
>>> 0.94
>>> 272
>>> 10
>>> 65.84
>>> 1.84
>>> 15.5
>>> 0.118602
>>> 148
>>> 2.5
>>> 101
>>> 21
>>> 29
>>> 2.91
>>> 2.91
>>> 10.6
>>> 5.83
>>> 3
>>> 3
>>> 1479
>>> 8
>>> 35
>>> 17
>>> 18
>>> 0.94
>>> 317
>>> 8
>>> 74.9
>>> 2.59
>>> 20.68
>>> 0.125389
>>>
>>> Thank you!
>>> Elizabeth
>>>
>>> On Thu, May 9, 2013 at 7:50 AM, David Carlson
<dcarlson at tamu.edu> wrote:
>>> Since you pass your entire data.frame to metaMDS(), your first
error
>>> probably comes from the fact that you have included ID as one of
the
>>> variables. You should look at the results of
>>>
>>> str(dat)
>>>
>>> You can drop cases with missing values using
>>>
>>> > dat2 <- na.omit(dat)
>>> > metaMDS(dat2[,-1])
>>>
>>> would run the analysis on all but the first column (ID) with all
the cases
>>> containing complete data. But that assumes that sex and exposure
are not
>>> factors.
>>>
>>> Or you could use one of the distance functions in dist() which
adjust for
>>> missing values. However dist() does not have an option to use
Bray-Curtis
>>> (the default in metaMDS()). Bray-Curtis is designed for
comparing species
>>> counts or proportions so it is not clear that it is an
appropriate
>>> dissimilarity measure for your data. Further, your data seem
contain a
>>> mixture of measurement scales and/or magnitudes so some variable
>>> standardization or transformations are probably necessary before
you can
>>> get
>>> any useful results from MDS.
>>>
>>> -------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77840-4352
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org]
>>> On
>>> Behalf Of Elizabeth Beck
>>> Sent: Wednesday, May 8, 2013 3:39 PM
>>> To: r-help at r-project.org
>>> Subject: [R] NMDS with missing data?
>>>
>>> Hi,
>>> I'm trying to run NMDS (non-metric multidimensional scaling)
with R vegan
>>> (metaMDS) but I have a few NAs in my data set. I've tried to run
it 2 ways.
>>>
>>> The first way with my entire data set which includes variables
such as ID,
>>> sex, exposure, treatment, sodium, potassium, chloride....
>>>
>>> mydata.mds<-metaMDS(dat)
>>>
>>> I get the following error:
>>>
>>>  in if (any(autotransform, noshare > 0, wascores) && any(comm <
0)) { :
>>>   missing value where TRUE/FALSE needed
>>> In addition: Warning messages:
>>> 1: In Ops.factor(left, right) : < not meaningful for factors
>>> 2: In Ops.factor(left, right) : < not meaningful for factors
>>> 3: In Ops.factor(left, right) : < not meaningful for factors
>>> 4: In Ops.factor(left, right) : < not meaningful for factors
>>> 5: In Ops.factor(left, right) : < not meaningful for factors
>>>
>>> The second way with only those last biochemical variables (29 in
total).
>>>
>>> mydata.mds<-metaMDS(measurements)
>>>
>>> I get this error:
>>>
>>> Error in if (any(autotransform, noshare > 0, wascores) &&
any(comm < 0)) {
>>> :
>>>   missing value where TRUE/FALSE needed
>>>
>>> My go to "na.rm=TRUE" does nothing. Any ideas on how to account
for NAs and
>>> if so which of the above options I should be using?
>>> Thanks!
>>> Elizabeth
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>>
>>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
>
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/p
db-biostatistics/pdb-ncb-home.htm
>

--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/p
db-biostatistics/pdb-ncb-home.htm