[R] subsetting large data frames.

Sun Dec 7 20:04:18 CET 2008

on 12/07/2008 11:16 AM hesicaia wrote:
> Hi all,
>   I have a question regarding subsetting of large data frames. I have two
> data frames “catches” and “tows” and they both have the same 30 variables
> (columns). I would like to select rows in the data frame “tows” where all 5
> specific variables are NOT matched in “catches. That is to say, the
> combination of these 5 variables is unique. One or more of the variables
> could be the same but the combination would be unique. This is confusing to
> explain so here is a short example to explain what I am trying to explain:
> 
> Example data catches:
> 
> Row	Cruise	Order	Townumber	Towtype	Ship	Netlocation	Var1	Var2
> 1	 22    	1	               4	              A	   B	        S      	X1	X2
> 2	 22	        1	               4	              A	   B 	        S      	X1	X2
> 3	 22	        1	               4	              BL	   AM	S      	X1	X2
> 4	 22	        1	               4 	              BL	   AM	S      	X1	X2
> 5	 260	        1	               4	              BL	    B  	S      	X1	X2
> 6	 260	        1	               4	              BL     	    B  	S      	X1
> X2
>  
> Example data tows:
> 
> Row	Cruise	Order	Townumber	Towtype	Ship	Netlocation	Var1	Var2
> 1	22     	1               	4       	A      	B      	S      	X1	X2
> 2	400    	1               	4       	BL	        AM    	S      	X1	X2
> 3	260    	1               	4       	BL     	B      	S      	X1	X2
> 4	260   	10             	10     	BL     	B      	S      	X1	X2
> 5	22     	99             	4       	BL     	B      	S      	X1	X2
> 
> I would want to select rows 2, 4, and 5 from “tows” due to the fact that the
> same collection of “cruise”, ”order”, ”townumber”, ”towtype”, ”ship”, and
> ”netlocation” are not found in “catches”. All rows in data set “tows” are
> unique. Clear as mud? Sorry I couldn’t provide real data, but these datasets
> are quite large. 
> 
> So far I have tried:
> 
> New<-tows[(tows$cruise != catches$cruise) & (tows$order != catches$order) &
> (tows$townumber !=  catches$townumber) & (tows$towtype != catches$towtype) &
> (tows$ship != catches$ship) & (tows$netlocation != catches$netlocation),]
>  
> But this didn’t work. 
> Thanks for your time and help (in advance).
> Dan.

Your statement above won't work, as there are a different number of rows
in each dataframe.

Such comparisons would be made on a row-by-row basis. Thus rather than
looking for any non-matching combinations across rows, you would be only
comparing row 1 with row 1, row 2 with row 2 and so on.

More generally, subset() is a better approach when using complicated
logicals in a subsetting operation.

With this situation, one approach would be to use merge() to check for
rows that do have matches in the relevant columns. Then take the Row
values (assuming that these are unique) that do not end up in the
merge()d dataset.

So, the first step is to use merge() to link the two dataframes based
upon the matching values and force all of the rows in tows to be
included via 'all.x = TRUE':

> merge(tows, catches, by = c("Cruise", "Order", "Townumber", "Towtype",
                              "Ship", "Netlocation"),
        all.x = TRUE)
  Cruise Order Townumber Towtype Ship Netlocation Row.x Var1.x Var2.x
1     22     1         4       A    B           S     1     X1     X2
2     22     1         4       A    B           S     1     X1     X2
3     22    99         4      BL    B           S     5     X1     X2
4    260     1         4      BL    B           S     3     X1     X2
5    260     1         4      BL    B           S     3     X1     X2
6    260    10        10      BL    B           S     4     X1     X2
7    400     1         4      BL   AM           S     2     X1     X2
  Row.y Var1.y Var2.y
1     1     X1     X2
2     2     X1     X2
3    NA   <NA>   <NA>
4     5     X1     X2
5     6     X1     X2
6    NA   <NA>   <NA>
7    NA   <NA>   <NA>

Note that 'Row', since there is the same column in each dataframe, has a
'.x' and '.y' appended to denote the initial source dataframe.

Note further, that Row.y has NA values, for the rows in catches that did
not have matches in tows.

We can adjust the above to take only these two columns:

DF <- merge(tows, catches,
            by = c("Cruise", "Order", "Townumber", "Towtype",
                    "Ship", "Netlocation"),
            all.x = TRUE)[, c("Row.x", "Row.y")]

> DF
  Row.x Row.y
1     1     1
2     1     2
3     5    NA
4     3     5
5     3     6
6     4    NA
7     2    NA

Now, get the values of Row.x as a vector, where Row.y is NA:

> subset(DF, is.na(Row.y))[[1]]
[1] 5 4 2

Thus:

> tows[subset(DF, is.na(Row.y))[[1]], ]
  Row Cruise Order Townumber Towtype Ship Netlocation Var1 Var2
5   5     22    99         4      BL    B           S   X1   X2
4   4    260    10        10      BL    B           S   X1   X2
2   2    400     1         4      BL   AM           S   X1   X2

See ?merge and ?subset

HTH,

Marc Schwartz