[R] Remove duplicates from a data frame but with some special requirements

gcam gcam032 at gmail.com
Thu Dec 17 20:31:15 CET 2009


Thanks Gray,

This helps, I'd completely forgotten about the subset command.  However, it
doesn't quite get me where I need.  Perhaps an example will help.  I will
simplify my dataframe to the three important variables:

ESR_ref   ESR_ref_edit    Loaded
1.1          1.1                  Y
1.1.1        1.1                  NC
1.1.2        1.1                 Y
2.1           2.1                  N
2.1.1         2.1                 Y
2.1.2        2.1                  PU
2.1.3        2.1                   Y
3.1           3.1                  Y
4.1           4.1                  N
4.1.1        4.1                   PU

So I've created the "edit" variable so I can test for duplicates (i.e.
samples with more than one sub-sample) because this is not of interest at
this point.  I just want one subsample per sample.  However, if we consider
2.1 - this would result in a subset (if duplicates were removed) with the
first line which has an "N".  But it is of interest to me the if at least
one of the subsamples has a "Y" then I want that line rather than a
subsample with another code.  1.1 in this example works by default because
the first subsample is a "Y" so it will retain that information.

Thanks

Gareth


Gray Calhoun-2 wrote:
> 
> Hi,
> Try:
> 
> subset(Samps, !duplicated(Samps$ESR_ref_edit) | Samps$Loaded == "Y")
> 
> I'd need specific code to be sure that this is exactly what you want
> (ie you specify input and desired output), but indexing with a logical
> vector is probably going to be the solution.
> 
> Best,
> Gray
> 
> On Wed, Dec 16, 2009 at 7:55 PM, gcam <gcam032 at gmail.com> wrote:
>>
>> Hi all.
>>
>> So I have a data frame with multiple columns/variables.  The first
>> variable
>> is a major sample name for which there are some sub-samples.  Currently I
>> have used the following command to remove the duplicates:
>>
>> Samps_working<-Samps[-c(which(duplicated(Samps$ESR_Ref_edit))),]
>>
>> This removes all of the duplicated sample rows.
>>
>> However, I just realised that, of course, this removes the first
>> observation
>> of each duplicated set.  However, I wish to retain any that have the code
>> "Y" in another variable Samps$Loaded.  I'm at a bit of a loss as to how
>> best
>> to approach this problem.
>>
>> Just to reiterate.  I want to remove all duplicate lines based on sample
>> name, but, I want the lines to be removed with a preference given to
>> those
>> that do not include a "Y" in the Loaded variable column.
>> --
>> View this message in context:
>> http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p965745.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 
> 
> -- 
> Gray Calhoun
> 
> Assistant Professor of Economics
> Iowa State University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p974312.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list