[R] how to convert a data.frame to tree structure object such as dendrogram

Bert Gunter gunter.berton at gene.com
Wed Mar 13 21:12:49 CET 2013


Here is a simpler, less clumsy version of my previous recursive R
solution that I sent you privately, which I'll also cc to the list
this time. It's now almost a one-liner.

To avoid problems with unused factor levels, I still prefer to have
character vectors not factors, as the data frame columns so:

df <- data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1',
'Bd2', 'C11','C12','C13'), stringsAsFactors=FALSE)

makeTree2 <-function(x, i,n)
{
  if(i==n)df[x,i]
  else {
    spl <- split(x,df[x,i])
    lapply(spl,function(x)makeTree2(x,i+1,n))   ##Can't use Recall()
  }
}

This is now called as

> makeTree2(seq_len(nrow(df)),1,ncol(df))  ## no list structure needed for x
## yielding (with the root implicit now)

$A
$A$Aa
[1] "Aa1"

$A$Ab
[1] "Ab1" "Ab2"


$B
$B$Ba
[1] "Ba1"

$B$Bd
[1] "Bd2"


$C
$C$C1
[1] "C11"

$C$C2
[1] "C12"

$C$C3
[1] "C13"



On Wed, Mar 13, 2013 at 10:25 AM, Not To Miss <not.to.miss at gmail.com> wrote:
> The ideal solution, I think, is probably recursive. In the last min I
> decided to wrote a python script to do this ( use python instead of perl or
> R, because of python mutable dict data structure), although I had preferred
> to keep all my code in one R piece. I post code here just in case you are
> interested. It generates a dict of dict of dict ...
>
> Hopefully I would not get beaten up for posting python code in R mailing
> list. :-)
>
>     import sys
>     tree = {}
>     ## input file is a table with columns TAB demilited
>     for line in open(sys.argv[1]):
>         if line.startswith('#'): continue
>         items = line.strip().split('\t')
>         tmp = tree
>         for item in items:
>             if not item in tmp:
>                 tmp[item]={}
>             tmp = tmp[item]
>
> The tree looks like this for the example:
> {'A': {'Aa': {'Aa1': {}}, 'Ab': {'Ab1': {}, 'Ab2': {}}}, 'C': {'C3': {'C13':
> {}}, 'C2': {'C12': {}}, 'C1': {'C11': {}}}, 'B': {'Bd': {'Bd2': {}}, 'Ba':
> {'Ba1': {}}}}
>
> On Wed, Mar 13, 2013 at 10:35 AM, David Winsemius <dwinsemius at comcast.net>
> wrote:
>>
>>
>> On Mar 12, 2013, at 9:22 PM, Not To Miss wrote:
>>
>> Nope, Bert, you miss me? :-D
>>
>> I apologize that I didn't provide a more realistic example and describe
>> the problem more clearly. The real data are just too complicated to post in
>> emails, so I made up a simple example, which perhaps seems a little over
>> simplistic now, but the basic structure are the same. Here is a more
>> approapriate one:
>> >data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
>> > 'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2',
>> > 'C11','C12','C13'))
>>   a  b   c
>> 1 A Aa Aa1
>> 2 A Ab Ab1
>> 3 A Ab Ab2
>> 4 B Ba Ba1
>> 5 B Bd Bd2
>> 6 C C1 C11
>> 7 C C2 C12
>> 8 C C3 C13
>>
>> The data structure to convert to:
>>      |---Aa------Aa1
>>  A---|        /--Ab1
>>  |   |---Ab--|
>>  |            \--Ab2
>>  |   |---Ba------Ba1
>>  B---|
>>  |   |---Bd------Bd2
>>  |
>>  |    /---C1-----C11
>>  C---|----C2-----C12
>>       \---C3-----C13
>>
>> It's multi-level nested and I won't know how many rows and columns of the
>> data.frame ahead of time. I plan to write a perl script to do the
>> conversion, just more familiar, if it's not easy to do in R. Thanks Don and
>> Greg for suggesting solutions.
>>
>>
>> After a bit of coding I am going to say your proposed answer is wrong (or
>> at least improperly specified). The first level can be recovered as you
>> suggest :
>>
>> > sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ])
>> $A
>> [1] "Aa" "Ab" "Ab"
>>
>> $B
>> [1] "Ba" "Bd"
>>
>> $C
>> [1] "C1" "C2" "C3"
>>
>>
>> But the second level cannot be as you imagined. The third level items
>> beginning with "C1" all get associated together and there are no terminal
>> nodes for C2 or C3 at the third level.
>>
>> > sapply(unique(dfrm[[2]]), function(x) dfrm[[3]][grep(x, dfrm[[3]]) ])
>> $Aa
>> [1] "Aa1"
>>
>> $Ab
>> [1] "Ab1" "Ab2"
>>
>> $Ba
>> [1] "Ba1"
>>
>> $Bd
>> [1] "Bd2"
>>
>> $C1
>> [1] "C11" "C12" "C13"
>>
>> $C2
>> character(0)
>>
>> $C3
>> character(0)
>>
>> lev1 <- sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]])
>> ])
>>  lapply(lev1, function(ll) lapply(ll, function(lll) dfrm[[3]][grep(lll,
>> dfrm[[3]]) ])  )
>>
>> $A
>> $A[[1]]
>> [1] "Aa1"
>>
>> $A[[2]]
>> [1] "Ab1" "Ab2"
>>
>> $A[[3]]
>> [1] "Ab1" "Ab2"
>>
>>
>> $B
>> $B[[1]]
>> [1] "Ba1"
>>
>> $B[[2]]
>> [1] "Bd2"
>>
>>
>> $C
>> $C[[1]]
>> [1] "C11" "C12" "C13"
>>
>> $C[[2]]
>> character(0)
>>
>> $C[[3]]
>> character(0)
>>
>> --
>> David.
>>
>>
>>
>> On Tue, Mar 12, 2013 at 2:18 PM, Bert Gunter <gunter.berton at gene.com>
>> wrote:
>>>
>>> So Mr. "not.tomiss" missed?
>>>
>>> :(
>>>
>>> -- Bert
>>>
>>> On Tue, Mar 12, 2013 at 1:08 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>> >
>>> > On Mar 12, 2013, at 9:37 AM, Not To Miss wrote:
>>> >
>>> >> Thanks. Is there any more elegant solution? What if I don't know how
>>> >> many
>>> >> levels of nesting ahead of time?
>>> >
>>> > It's even worse than what you now offer as a potential complication.
>>> > You did not provide an example of a data object that would illustrate the
>>> > complexity of the task nor what you consider the correct procedure (i.e. the
>>> > order of the columns to be used for splitting) nor the correct results. The
>>> > task is woefully underspecified at the moment. It's a bit akin to asking
>>> > "how do I do classification" without saying what you what to classify.
>>> >
>>> > --
>>> > David.
>>> >>
>>> >>
>>> >> On Tue, Mar 12, 2013 at 8:51 AM, Greg Snow <538280 at gmail.com> wrote:
>>> >>
>>> >>> You can use the lapply or rapply functions on the resulting list to
>>> >>> break
>>> >>> each piece into a list itself, then apply the lapply or rapply
>>> >>> function to
>>> >>> those resulting lists, ...
>>> >>>
>>> >>>
>>> >>> On Mon, Mar 11, 2013 at 3:41 PM, Not To Miss
>>> >>> <not.to.miss at gmail.com>wrote:
>>> >>>
>>> >>>> Thanks. That's just an simple example - what if there are more
>>> >>>> columns and
>>> >>>> more rows? Is there any easy way to create nested list?
>>> >>>>
>>> >>>> Best,
>>> >>>> Zech
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Mar 11, 2013 at 2:12 PM, MacQueen, Don <macqueen1 at llnl.gov>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> You will have to decide what R data structure is a "tree
>>> >>>>> structure". But
>>> >>>>> maybe this will get you started:
>>> >>>>>
>>> >>>>>> foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
>>> >>>>>> split(foo$y, foo$x)
>>> >>>>> $A
>>> >>>>> [1] "Ab" "Ac"
>>> >>>>>
>>> >>>>> $B
>>> >>>>> [1] "Ba" "Bd"
>>> >>>>>
>>> >>>>> I suppose it is at least a little bit tree-like.
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Don MacQueen
>>> >>>>>
>>> >>>>> Lawrence Livermore National Laboratory
>>> >>>>> 7000 East Ave., L-627
>>> >>>>> Livermore, CA 94550
>>> >>>>> 925-423-1062
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote:
>>> >>>>>
>>> >>>>>> I have a data.frame object like:
>>> >>>>>>
>>> >>>>>>> data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
>>> >>>>>> x  y
>>> >>>>>> 1 A Ab
>>> >>>>>> 2 A Ac
>>> >>>>>> 3 B Ba
>>> >>>>>> 4 B Bd
>>> >>>>>>
>>> >>>>>> how could I create a tree structure object like this:
>>> >>>>>>    |---Ab
>>> >>>>>> A---|
>>> >>>>>> _|   |---Ac
>>> >>>>>> |
>>> >>>>>> |   |---Ba
>>> >>>>>> B---|
>>> >>>>>>    |---Bb
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Zech
>>> >>>>>>
>>> >>>>>>      [[alternative HTML version deleted]]
>>> >>>>>>
>>> >>>>>> ______________________________________________
>>> >>>>>> R-help at r-project.org mailing list
>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>>>>> PLEASE do read the posting guide
>>> >>>>>> http://www.R-project.org/posting-guide.html
>>> >>>>>> and provide commented, minimal, self-contained, reproducible code.
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>>        [[alternative HTML version deleted]]
>>> >>>>
>>> >>>> ______________________________________________
>>> >>>> R-help at r-project.org mailing list
>>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>>> PLEASE do read the posting guide
>>> >>>> http://www.R-project.org/posting-guide.html
>>> >>>> and provide commented, minimal, self-contained, reproducible code.
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Gregory (Greg) L. Snow Ph.D.
>>> >>> 538280 at gmail.com
>>> >>>
>>> >>
>>> >>       [[alternative HTML version deleted]]
>>> >>
>>> >> ______________________________________________
>>> >> R-help at r-project.org mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> PLEASE do read the posting guide
>>> >> http://www.R-project.org/posting-guide.html
>>> >> and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> > David Winsemius
>>> > Alameda, CA, USA
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Biostatistics
>>>
>>> Internal Contact Info:
>>> Phone: 467-7374
>>> Website:
>>>
>>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>>
>>
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list