[R] problem with duplicated function

Curtis Burkhalter curtisburkhalter at gmail.com
Sun May 24 23:34:13 CEST 2015


Hello everyone,

I have two very large dataframes (~1 million rows x 5 columns), of which
two of the columns are lat/long coordinates. The names of the dataframes
are 'data07' and 'data 08'. Data08 has a few more sampling points than data
07 so I want to subset data08 so that it has the same number of data points
as data07 using the unique lat/long coordinates.

Here are the associated data structures:

*str(data07)*
'data.frame':   969109 obs. of  5 variables:
 $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
705507 710363 ...
 $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
 $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
24 24 24 ...
 $ Xcor    : num  -111 -111 -111 -111 -111 ...
 $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...

*str(data08)*
'data.frame':   969810 obs. of  5 variables:
 $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
700771 705664 ...
 $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
 $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
23 23 ...
 $ Xcor    : num  -111 -111 -111 -111 -111 ...
 $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...

I've tried using the following code to accomplish my problem:

tt <- rbind(data07, data08)

tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
last 2 cols                                            #that correspond to
the lat/long

tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
n)

test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08

When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
true.

Here's a small subset of the data so that you can see exactly where there
are duplicates

data07[1:10,]
                 cell prN Location     Xcor    Ycor
710229 *710228 288     Sage -111.044 41.7403*
715546 *715545 276     Sage -111.044 41.7245*
720691 *720690 286     Sage -111.044 41.7131*
720825 *720824 304     Sage -111.044 41.7109*
695612 695611 258     Sage -111.043 41.7766
700491 700490 257     Sage -111.043 41.7653
700627 700626 264     Sage -111.043 41.7630
705372 705371 272     Sage -111.043 41.7517
705508 705507 286     Sage -111.043 41.7495
710364 710363 316     Sage -111.043 41.7381

 data08[1:10,]
                 cell prN Location     Xcor    Ycor
705529 705528 293     Sage -111.044 41.7517
710322 *710321 281     Sage -111.044 41.7403*
710457 710456 299     Sage -111.044 41.7381
715678 *715677 278     Sage -111.044 41.7245*
720763 *720762 276     Sage -111.044 41.7131*
720897 *720896 266     Sage -111.044 41.7109*
699954 699953 282     Sage -111.043 41.7767
700636 700635 255     Sage -111.043 41.7653
700772 700771 287     Sage -111.043 41.7631
705665 705664 280     Sage -111.043 41.7495


If anyone has any suggestions as to where I might be going wrong I'd
greatly appreciate it.

Thank you




-- 
Curtis Burkhalter
Postdoctoral Research Associate, Audubon Rockies

https://sites.google.com/site/curtisburkhalter/

	[[alternative HTML version deleted]]



More information about the R-help mailing list