[R] how to find number of unique rows for combination of r columns

Fri Nov 8 18:39:55 CET 2019

Correction:
df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)])
df[!duplicated(df[,2:3]), ]  ## Note the ! sign

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> Sorry, but you ask basic questions.You really need to spend some more time
> with an R tutorial or two. This list is not meant to replace your own
> learning efforts.
>
> You also do not seem to be reading the docs carefully. Under ?unique, it
> links ?duplicated and tells you that it gives indices of duplicated rows of
> a data frame. These then can be used by subscripting to remove those rows
> from the data frame. Here is a reproducible example:
>
> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)])
> df[-duplicated(df[,2:3]), ]  ## Note the - sign
>
> If you prefer, the "Tidyverse" world has what are purported to be more
> user-friendly versions of such data handling functionality that you can use
> instead.
>
>
> Bert
>
> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija using gmail.com>
> wrote:
>
>> would you know how would I extract from my original data frame, just
>> these unique rows?
>> because this gives me only those 3 columns, and I want all columns
>> from the original data frame
>>
>> > head(udt)
>>    chr   pos         gene_id
>> 1 chr1 54490 ENSG00000227232
>> 2 chr1 58814 ENSG00000227232
>> 3 chr1 60351 ENSG00000227232
>> 4 chr1 61920 ENSG00000227232
>> 5 chr1 63671 ENSG00000227232
>> 6 chr1 64931 ENSG00000227232
>>
>> > head(dt)
>>     chr   pos         gene_id pval_nominal pval_ret       wl      wr
>> META
>> 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278 21.2838
>> 0.7475480
>> 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278 21.2838
>> 0.6031214
>> 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278 21.2838
>> 0.6907182
>> 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278 21.2838
>> 0.4032200
>> 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278 21.2838
>> 0.7482519
>> 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278 21.2838
>> 0.5974800
>>
>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija using gmail.com>
>> wrote:
>> >
>> > Thank you so much! Converting it to data frame resolved the issue!
>> >
>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
>> > <gerrit.eichner using math.uni-giessen.de> wrote:
>> > >
>> > > It seems as if dt is not a (base R) data frame but a
>> > > data table. I assume, you will have to transform dt
>> > > into a data frame (maybe with as.data.frame) to be
>> > > able to apply unique in the suggested way. However,
>> > > I am not familiar with data tables. Perhaps somebody
>> > > else can provide a more profound guess.
>> > >
>> > >   Regards  --  Gerrit
>> > >
>> > > ---------------------------------------------------------------------
>> > > Dr. Gerrit Eichner                   Mathematical Institute, Room 212
>> > > gerrit.eichner using math.uni-giessen.de   Justus-Liebig-University Giessen
>> > > Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
>> > > http://www.uni-giessen.de/eichner
>> > > ---------------------------------------------------------------------
>> > >
>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija:
>> > > > I tried it but I got this error:
>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")])
>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) :
>> > > >    When i is a data.table (or character vector), the columns to
>> join by
>> > > > must be specified using 'on=' argument (see ?data.table), by keying
>> x
>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
>> > > > column names between x and i (i.e., a natural join). Keyed joins
>> might
>> > > > have further speed benefits on very large data due to x being sorted
>> > > > in RAM.
>> > > >
>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
>> > > > <gerrit.eichner using math.uni-giessen.de> wrote:
>> > > >>
>> > > >> Hi, Ana,
>> > > >>
>> > > >> doesn't
>> > > >>
>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")])
>> > > >> nrow(udt)
>> > > >>
>> > > >> get close to what you want?
>> > > >>
>> > > >>    Hth  --  Gerrit
>> > > >>
>> > > >>
>> ---------------------------------------------------------------------
>> > > >> Dr. Gerrit Eichner                   Mathematical Institute, Room
>> 212
>> > > >> gerrit.eichner using math.uni-giessen.de   Justus-Liebig-University
>> Giessen
>> > > >> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen,
>> Germany
>> > > >> http://www.uni-giessen.de/eichner
>> > > >>
>> ---------------------------------------------------------------------
>> > > >>
>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija:
>> > > >>> Hello,
>> > > >>>
>> > > >>> I have a data frame like this:
>> > > >>>
>> > > >>>> head(dt,20)
>> > > >>>        chr    pos         gene_id pval_nominal  pval_ret
>>  wl      wr
>> > > >>>    1: chr1  54490 ENSG00000227232    0.6084950 0.7837780 31.62278
>> 21.2838
>> > > >>>    2: chr1  58814 ENSG00000227232    0.2952110 0.8975820 31.62278
>> 21.2838
>> > > >>>    3: chr1  60351 ENSG00000227232    0.4397880 0.8679590 31.62278
>> 21.2838
>> > > >>>    4: chr1  61920 ENSG00000227232    0.3195280 0.6018090 31.62278
>> 21.2838
>> > > >>>    5: chr1  63671 ENSG00000227232    0.2377390 0.9880390 31.62278
>> 21.2838
>> > > >>>    6: chr1  64931 ENSG00000227232    0.2766790 0.9070370 31.62278
>> 21.2838
>> > > >>>    7: chr1  81587 ENSG00000227232    0.6057930 0.6167630 31.62278
>> 21.2838
>> > > >>>    8: chr1 115746 ENSG00000227232    0.4078770 0.7799110 31.62278
>> 21.2838
>> > > >>>    9: chr1 135203 ENSG00000227232    0.4078770 0.9299130 31.62278
>> 21.2838
>> > > >>> 10: chr1 138593 ENSG00000227232    0.8464560 0.5696060 31.62278
>> 21.2838
>> > > >>>
>> > > >>> it is very big,
>> > > >>>> dim(dt)
>> > > >>> [1] 73719122        8
>> > > >>>
>> > > >>> To count number of unique rows for all 3 columns: chr, pos and
>> gene_id
>> > > >>> I could just join those 3 columns and than count. But how would I
>> find
>> > > >>> unique number of rows for these 4 columns without joining them?
>> > > >>>
>> > > >>> Thanks
>> > > >>> Ana
>> > > >>>
>> > > >>> ______________________________________________
>> > > >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> > > >>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > > >>> and provide commented, minimal, self-contained, reproducible code.
>> > > >>>
>> > > >>
>> > > >> ______________________________________________
>> > > >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
>> > > >> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > > >> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]