[R] Is there a hash data structure for R

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Wed Nov 3 16:35:25 CET 2021


Jack, I was agreeing with you and pointing out that although changing names
of columns to be unique has a positive side, it makes it hard to use for
anything that needs to look like either a set or a bag and of course a
dictionary/hash. All the above want to put things in using some identifier
and expect to get back the same.

R actually has other places names can be changed or dynamically created
often with defaults. This can be convenient or annoying as you need to look
at the resulting names to be able to use them. A feature that is nice in
some programs and parts of the tidyverse packages is the ability to specify
things like suffixes to be used.

I use lots of computer languages and keep running into people who expect
another language to just support what the first does. If that were true, we
might not create so many. If your other language supports indefinite sized
integers, good for you. Many others do not and perhaps may not easily do so
even if you try to create your own emulation if some other code gets your
version and does not work on it.

So assuming you use an available package or roll your own and can now make
some kind of hash data structure. As you pointed out, hashing may not be
required if your implementation is already fast and hashing can use lots of
memory. What are the allowed keys in your implementation? Will an integer of
1 be distinct from a floating point of 1.0? Can you hash objects of the
half-dozen or so kinds R seems to have? Whatever your answers are to many
questions like these, you may not get quite the same answers in Python or
PERL or ...

R is not really meant to be a general-purpose language although it can be.
It is not really fully-blown Object Oriented  but in many ways it can be
using newer grafts. So if you need or want the things R is designed for and
in particular there are good packages available for your needs, then use R.
If your needs include things not easily part of R, and you have a language
that works for you, why switch?

I will note that there is an intermediate path. I often run programs in
RSTUDIO that include code for both R and Python. The data structures used
sort of convert back and forth as needed so you can begin in R and read in
data and make some changes and then hand it over to Python for more, perhaps
multiple times, then generate a graph within R or whatever.

So if you want a dictionary, you can sort of keep it on the Python side and
use Python commands to create it and add to it or access contents. The
results may be handed over to the R side as needed but not as a dict but
instead to a pairlist or named list/vector or whatever and when needed, you
can have python take your results. The same applies to lots of other things
Python does that R may not do quite the same or at all. I mean generators
and all kinds of object-oriented programs including multiple inheritance and
so much more. You can use each language for what it is good at or that work
with the way you think and with some overhead get the best of both worlds.

Of course, this comes with costs and any programs you send out to be used by
others would require both languages to be installed properly and ...

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Jan van der Laan
Sent: Wednesday, November 3, 2021 5:47 AM
To: r-help using r-project.org
Subject: Re: [R] Is there a hash data structure for R



On 03-11-2021 00:42, Avi Gross via R-help wrote:

> 
> Finally, someone mentioned how creating a data.frame with duplicate 
> names for columns is not a problem as it can automagically CHANGE them 
> to be unique. That is a HUGE problem for using that as a dictionary as 
> the new name will not be known to the system so all kinds of things will
fail.

I think you are referring to my remark which was:

 > However, the data.frame construction method will detect this and  >
generate unique names (which also might not be what you want):

I didn't say this means that duplicate names aren't a problem; I just
mentioned the the behaviour is different. Personally, I would actually
prefer the behaviour of list (keep the duplicated name) with a warning.

Most of the responses seem to assume that the OP actually wants a hash
table. Yes, he did ask for that and for a hash table an environment (with
some work) would be a good option. But in many cases, where other languages
would use a hash-table-like object (such as a dict) in R you would use other
types of objects. Furthermore, for many operations where you might use hash
tables to implement the operation, R has already built in options, for
example %in%, match, duplicated. These are also vectorised; so two vectors:
one with keys and one with values might actually be faster than an
environment in some use cases.

Best,
Jan


> 
> And there are also packages for many features like sets as well as 
> functions to manipulate these things.
> 
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Bill Dunlap
> Sent: Tuesday, November 2, 2021 1:26 PM
> To: Andrew Simmons <akwsimmo using gmail.com>
> Cc: R Help <r-help using r-project.org>
> Subject: Re: [R] Is there a hash data structure for R
> 
> Note that an environment carries a hash table with it, while a named 
> list does not.  I think that looking up an entry in a list causes a 
> hash table to be created and thrown away.  Here are some timings 
> involving setting and getting various numbers of entries in 
> environments and lists.  The times are roughly linear in n for
environments and quadratic for lists.
> 
>> vapply(1e3 * 2 ^ (0:6), f, L=new.env(parent=emptyenv()),
> FUN.VALUE=NA_real_)
> [1] 0.00 0.00 0.00 0.02 0.03 0.06 0.15
>> vapply(1e3 * 2 ^ (0:6), f, L=list(), FUN.VALUE=NA_real_)
> [1]  0.01  0.03  0.15  0.53  2.66 13.66 56.05
>> f
> function(n, L, V = sprintf("V%07d", sample(n, replace=TRUE))) {
>      system.time(for(v in V)L[[v]]<-c(L[[v]],v))["elapsed"] }
> 
> Note that environments do not allow an element named "" (the empty
string).
> 
> Elements named NA_character_ are treated differently in environments 
> and lists, neither of which is great.  You may want your hash table 
> functions to deal with oddball names explicitly.
> 
> -Bill
> 
> On Tue, Nov 2, 2021 at 8:52 AM Andrew Simmons <akwsimmo using gmail.com> wrote:
> 
>> If you're thinking about using environments, I would suggest you 
>> initialize them like
>>
>>
>> x <- new.env(parent = emptyenv())
>>
>>
>> Since environments have parent environments, it means that requesting 
>> a value from that environment can actually return the value stored in 
>> a parent environment (this isn't an issue for [[ or $, this is 
>> exclusively an issue with assign, get, and exists) Or, if you've 
>> already got your values stored in a list that you want to turn into 
>> an
>> environment:
>>
>>
>> x <- list2env(listOfValues, parent = emptyenv())
>>
>>
>> Hope this helps!
>>
>>
>> On Tue, Nov 2, 2021, 06:49 Yonghua Peng <yong using pobox.com> wrote:
>>
>>> But for data.frame the colnames can be duplicated. Am I right?
>>>
>>> Regards.
>>>
>>> On Tue, Nov 2, 2021 at 6:29 PM Jan van der Laan <rhelp using eoos.dds.nl>
>> wrote:
>>>
>>>>
>>>> True, but in a lot of cases where a python user might use a dict an 
>>>> R user will probably use a list; or when we are talking about 
>>>> arrays of dicts in python, the R solution will probably be a 
>>>> data.frame (with
>> each
>>>> dict field in a separate column).
>>>>
>>>> Jan
>>>>
>>>>
>>>>
>>>>
>>>> On 02-11-2021 11:18, Eric Berger wrote:
>>>>> One choice is
>>>>> new.env(hash=TRUE)
>>>>> in the base package
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 2, 2021 at 11:48 AM Yonghua Peng <yong using pobox.com> wrote:
>>>>>
>>>>>> I know this is a newbie question. But how do I implement the hash
>>>> structure
>>>>>> which is available in other languages (in python it's dict)?
>>>>>>
>>>>>> I know there is the list, but list's names can be duplicated here.
>>>>>>
>>>>>>> x <- list(x=1:5,y=month.name,x=3:7)
>>>>>>
>>>>>>> x
>>>>>>
>>>>>> $x
>>>>>>
>>>>>> [1] 1 2 3 4 5
>>>>>>
>>>>>>
>>>>>> $y
>>>>>>
>>>>>>    [1] "January"   "February"  "March"     "April"     "May"
>>>   "June"
>>>>>>
>>>>>>    [7] "July"      "August"    "September" "October"   "November"
>>>> "December"
>>>>>>
>>>>>>
>>>>>> $x
>>>>>>
>>>>>> [1] 3 4 5 6 7
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks a lot.
>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list