[R] problem with duplicated function

Bert Gunter gunter.berton at gene.com
Sun May 24 23:55:43 CEST 2015


I have NOT looked at your code in detail -- I might have if you had
used dput() to make available small subsets of your data frames that
exhibited the problems. However, the following, from ?duplicated,
sounds like it may be relevant:

"When used on a data frame with more than one column, or an array or
matrix when comparing dimensions of length greater than one, this
tests for identity of character representations. This will catch
people who unwisely rely on exact equality of floating-point numbers!
"

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sun, May 24, 2015 at 2:34 PM, Curtis Burkhalter
<curtisburkhalter at gmail.com> wrote:
> Hello everyone,
>
> I have two very large dataframes (~1 million rows x 5 columns), of which
> two of the columns are lat/long coordinates. The names of the dataframes
> are 'data07' and 'data 08'. Data08 has a few more sampling points than data
> 07 so I want to subset data08 so that it has the same number of data points
> as data07 using the unique lat/long coordinates.
>
> Here are the associated data structures:
>
> *str(data07)*
> 'data.frame':   969109 obs. of  5 variables:
>  $ cell    : int  710228 715545 720690 720824 695611 700490 700626 705371
> 705507 710363 ...
>  $ prN     : int  288 276 286 304 258 257 264 272 286 316 ...
>  $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24
> 24 24 24 ...
>  $ Xcor    : num  -111 -111 -111 -111 -111 ...
>  $ Ycor    : num  41.7 41.7 41.7 41.7 41.8 ...
>
> *str(data08)*
> 'data.frame':   969810 obs. of  5 variables:
>  $ cell    : int  705528 710321 710456 715677 720762 720896 699953 700635
> 700771 705664 ...
>  $ prN     : int  293 281 299 278 276 266 282 255 287 280 ...
>  $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23
> 23 23 ...
>  $ Xcor    : num  -111 -111 -111 -111 -111 ...
>  $ Ycor    : num  41.8 41.7 41.7 41.7 41.7 ...
>
> I've tried using the following code to accomplish my problem:
>
> tt <- rbind(data07, data08)
>
> tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from
> last 2 cols                                            #that correspond to
> the lat/long
>
> tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first
> n)
>
> test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
>
> When I run the code 'tt.dup' is FALSE for all entries, which I know isn't
> true.
>
> Here's a small subset of the data so that you can see exactly where there
> are duplicates
>
> data07[1:10,]
>                  cell prN Location     Xcor    Ycor
> 710229 *710228 288     Sage -111.044 41.7403*
> 715546 *715545 276     Sage -111.044 41.7245*
> 720691 *720690 286     Sage -111.044 41.7131*
> 720825 *720824 304     Sage -111.044 41.7109*
> 695612 695611 258     Sage -111.043 41.7766
> 700491 700490 257     Sage -111.043 41.7653
> 700627 700626 264     Sage -111.043 41.7630
> 705372 705371 272     Sage -111.043 41.7517
> 705508 705507 286     Sage -111.043 41.7495
> 710364 710363 316     Sage -111.043 41.7381
>
>  data08[1:10,]
>                  cell prN Location     Xcor    Ycor
> 705529 705528 293     Sage -111.044 41.7517
> 710322 *710321 281     Sage -111.044 41.7403*
> 710457 710456 299     Sage -111.044 41.7381
> 715678 *715677 278     Sage -111.044 41.7245*
> 720763 *720762 276     Sage -111.044 41.7131*
> 720897 *720896 266     Sage -111.044 41.7109*
> 699954 699953 282     Sage -111.043 41.7767
> 700636 700635 255     Sage -111.043 41.7653
> 700772 700771 287     Sage -111.043 41.7631
> 705665 705664 280     Sage -111.043 41.7495
>
>
> If anyone has any suggestions as to where I might be going wrong I'd
> greatly appreciate it.
>
> Thank you
>
>
>
>
> --
> Curtis Burkhalter
> Postdoctoral Research Associate, Audubon Rockies
>
> https://sites.google.com/site/curtisburkhalter/
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list