[R] reshape2's dcast() Adds NAs to Data Frame

Thu Aug 9 15:26:44 CEST 2012

On Wed, 8 Aug 2012, Jeff Newmiller wrote:

> I took a closer look, and unused factor levels is not the problem... the 
> problem is defining id variables appropriately.

Jeff,

   OK.

> 1) "sample" is the name of a builtin function, so it is not advisable to use 
> it as the name of data. I have used "samp" instead of "sample"

   Ah, I did not realize that 'sample' was a reserved word.

> 2) Your input data is essentially in long form already, so you don't need to 
> melt it.

   Despite my multiple readings of the reshape2 docs I did not pick up on
this. All data to be analyzed with R are written out from database tables
which is the long format. I have worked on the idea that all reshaping
requires the data be melted to a common intermediate form before being cast
to another form, in my case wide.

> 3) It is almost never a good idea to use a floating point column as an id 
> variable. Perhaps you were imagining something like:
>
> > samp.cast <- dcast(samp[,1:5], site+sampdate+era~param, value.var="quant" > )

> Clearly, this still includes NA values, but if we look at the input data 
> corresponding to
> the first row:

   <snip>

> There are only 26 chemicals corresponding to that row, but there are a
> total of 54 different possible chemicals to quantify in the first row.
> Thus, there must be NA values inserted to fill out the data frame. (The
> problem gets worse when you try to keep those other data columns as id
> columns... they represent additional distinct combinations so you end up
> with more rows and fewer values in each row.)

   I see this now. This is a common situation with permit compliance
monitoring data. There are not always concentrations for each chemical at
every site and sampling date.

> I am not familiar with the NADA library, so I cannot suggest what you
> SHOULD be doing, but it does seem that you should perhaps study some more
> examples of its use to figure out what form you should have your data in.

   Agreed. This is what I've been trying to do, but without success so far.
Perhaps I need to subset the long form for each parameter so the data frame
hos no missing values. I'll delve more deeply into learning how to apply the
NADA methods to my data.

Thanks very much,

Rich