[R] merge a list of data frames

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Thu Sep 6 17:36:07 CEST 2012


"not practical [...] to rename the column to something unique."

On the contrary, since R is a scripting language it is quite practical. Depending on the format of your data, it is probably necessary as well.

If all of the files have exactly the same number of rows corresponding to the same key values, you may be able to load each file and cbind the V3 columns into a matrix. That is the only scenario I see where you don't need to rename the columns.

If you do need the key alignment capability that merge offers, then it may be most effective to load the data into a list of data frames, add a column to each data frame containing the desired column name (perhaps derived from the file name the data were loaded from, or just a letter with sequential numbers at the end), stack the data frames into a single frame (rbind or ldply from the plyr package), and then use the reshape2 package dcast function to form the wide, combined data frame with the necessary unique column names against your key c(V1,V2).

Depending on the algorithm you plan to use, you may not need or want to do the dcast step at all. The plyr package or the sqldf package or the ave base function can let you combine computations on groups of rows instead of on columns.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Sam Steingold <sds at gnu.org> wrote:

>> * David Winsemius <qjvafrzvhf at pbzpnfg.arg> [2012-09-05 21:02:16
>-0700]:
>>
>> On Sep 5, 2012, at 8:51 PM, Sam Steingold wrote:
>>
>>> I have a list of data frames:
>>> 
>>>> str(data)
>>> List of 4
>>> $ :'data.frame':	700773 obs. of  3 variables:
>>>  ..$ V1: chr [1:700773] "200130446465779" "200070050127778"
>>> "200030633708779" "200010587002779" ...
>>>  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
>>>  ..$ V3: num [1:700773] 1 1 1 1 1 ...
>>> $ :'data.frame':	700773 obs. of  3 variables:
>>>  ..$ V1: chr [1:700773] "200130446465779" "200070050127778"
>>> "200030633708779" "200010587002779" ...
>>>  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
>>>  ..$ V3: num [1:700773] 1 1 1 1 1 ...
>>> $ :'data.frame':	700773 obs. of  3 variables:
>>>  ..$ V1: chr [1:700773] "200130446465779" "200070050127778"
>>> "200030633708779" "200010587002779" ...
>>>  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
>>>  ..$ V3: num [1:700773] 1 1 1 1 1 ...
>>> $ :'data.frame':	700773 obs. of  3 variables:
>>>  ..$ V1: chr [1:700773] "200160325893778" "200130647544079"
>>> "200130446465779" "200120186959078" ...
>>>  ..$ V2: int [1:700773] 0 0 0 0 0 0 0 0 0 0 ...
>>>  ..$ V3: num [1:700773] 1 1 1 1 1 1 1 1 1 1 ...
>>> 
>>> I want to merge them.
>>
>> Why? What are you expecting?
>
>these are the results of applying a model to the test data.
>the first column is the ID
>the second column is the actual value
>the third column is the model score
>
>after I will merge the frames, I will
>1. check that all the V2 columns are identical and drop all but one
>(I guess I could just merge on c("V1","V2") instead, right?)
>
>2. compute the sum (or the mean, whatever is easier) of all the V3
>columns
>
>3. sort by the sum/mean of the V3 columns and evaluate the combined
>model using the lift quality metric
>(http://dl.acm.org/citation.cfm?id=380995.381018)
>
>I have many more score files (not just 4), so it is not practical for
>me
>to rename the column to something unique.




More information about the R-help mailing list