[R] how to remove factors from whole dataframe?

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Mon Sep 20 03:31:23 CEST 2021


I am not trying to "get you"; but you need to do your homework before
posting. Factor implementation is fully explained in Section 2.3.1 of
the "R Language Definition." You can also search on "enumerated
types"(mentioned in the Help page),  a long established C construct,
for a fuller explanation. Which makes all of your following remarks
just a bit pointless, no?

"I made a series of factors of various kinds such as integer and
logical and character and they all are simply a class of "factor" as
they have an attribute of 'class' and a typeof() "integer"  as the
payload is now a series of small integers.  A dictionary of sorts is
kept in the attribute of 'levels' but that seems to always be of type
character as in: ... etc."

Cheers,
Bert

On Sun, Sep 19, 2021 at 5:48 PM Avi Gross via R-help
<r-help using r-project.org> wrote:
>
> Bert, you got me. Factors seem to be implemented as a sort of one-way street. Mea maxima culpa!
>
> I did some experiments and clearly I misunderstood the way factors in R are set up. I made a series of factors of various kinds such as integer and logical and character and they all are simply a class of "factor" as they have an attribute of 'class' and a typeof() "integer"  as the payload is now a series of small integers.  A dictionary of sorts is kept in the attribute of 'levels' but that seems to always be of type character as in:
>
> levels(somelogsfac)
> [1] "FALSE" "TRUE"
> > typeof(levels(somelogsfac)[1])
> [1] "character"
>
>
> So it sounds like if I read in or create something like a data.frame and I convert some columns to factors to get the results I want like more compressed storage or maybe get ggplot to use my data in some specified order, I also lose any idea of what it was once. That is fine for some purposes such as where the info is wanted in text form, perhaps less so if it has to constantly be converted back into some other form.
>
> But it seems to be a destructive operation in the sense that once done, there is no info preserved as to what was there before. Mind you, that is not really a problem as doing many transformations like as.integer() also replaces what was with what now is.
>
> But in a sense it can be made reversible if you choose to extend it a bit. Below is some code I threw together that if called instead of "factor()" will hide away the original type so it can be used later as an argument to as() to bring back the original type. This is not a change to the original factor() function but I wonder if it would cause any incompatibility if this was made standard. Yes, it costs a little extra in storage.
>
>
> > ####
>   > # PURPOSE: Given a factor that has embedded knowlege of what
>   > # the underlying type once was, carefully reconstruct a vector
>   > # to put it back where it should have been. If this is a normal R
>   > # vector or factor, just return it as of type character.
>   > # If an attribute shows the information of what kind it is
>   > # was saved, convert it to that.
>   >
>   > unfactormem <- function(unorig) {
>     +   what_kind <- attr(unorig, "OnceWas")
>     +   unorigchar <- as.character(unorig)
>     +   if (is.null(what_kind)) return(unorigchar)
>     +   as(unorigchar, what_kind)
>     + }
> >
>   > # Make a logical vector
>   > somelogs <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
> >
>   > # convert it to factor the new way that saves that it was logical.
>   > somelogsfac <- factormem(orig)
> Error in factormem(orig) : object 'orig' not found
> >
>   > # display what it looks like
>   > somelogsfac
> [1] TRUE  FALSE FALSE TRUE  FALSE
> attr(,"OnceWas")
> [1] logical
> Levels: FALSE TRUE
> >
>   > attributes(somelogsfac)
> $levels
> [1] "FALSE" "TRUE"
>
> $class
> [1] "factor"
>
> $OnceWas
> [1] "logical"
>
> >
>   > attr(somelogsfac, "OnceWas")
> [1] "logical"
> >
>   > # Revivify it by hand, not the function I made.
>   > revived <- as(as.character(somelogsfac), attr(somelogsfac, "OnceWas"))
> >
>   > # show what it looks like:
>   > revived
> [1]  TRUE FALSE FALSE  TRUE FALSE
> >
>   > typeof(revived)
> [1] "logical"
> >
>   > # Alternately, use the function I created to bring it back.
>   >
>   > unfactormem(somelogsfac)
> [1]  TRUE FALSE FALSE  TRUE FALSE
> >
>   > typeof(unfactormem(somelogsfac))
> [1] "logical"
> >
>   > # A complex example:
>   >
>   > somecomplex <- c( 3+5i, 4+6i, 13, 4+6i, 7i, 13)
> > typeof(somecomplex)
> [1] "complex"
> > somecomplexfac <- factormem(somecomplex)
> > attributes(somecomplexfac)
> $levels
> [1] "0+7i"  "3+5i"  "4+6i"  "13+0i"
>
> $class
> [1] "factor"
>
> $OnceWas
> [1] "complex"
>
> > revivified <- unfactormem(somecomplexfac)
> > revivified
> [1]  3+5i  4+6i 13+0i  4+6i  0+7i 13+0i
> > typeof(revivified)
> [1] "complex"
>
>
> AGAIN, I am not asking this be done as a change in R, just that I think it is an idea of how to be able to say undo factorization properly when needed, such as before saving it to disk in some data structure, or when passing it for some other analysis where a non-factor form would work better.
>
> NOTE: I threw this together quickly and may well have made errors or not made it bulletproof. Feel free to point out where I am wrong or how to improve it.
>
> I also note my approach is partially based on  interactions I once had with Adrian Dusa when he shared his package "declared" and he needed to maintain additional info on some data brought into R that had multiple distinct categories of missing data. It had to carefully use attributes but also did much more to integrate the functionality more fully. So, yes, that might be something that could be done but this is just an academic exercise for me.
>
> -----Original Message-----
> From: Bert Gunter <bgunter.4567 using gmail.com>
> Sent: Sunday, September 19, 2021 7:19 PM
> To: Avi Gross <avigross using verizon.net>
> Cc: Luigi Marongiu <marongiu.luigi using gmail.com>; Rui Barradas <ruipbarradas using sapo.pt>; r-help <r-help using r-project.org>
> Subject: Re: [R] how to remove factors from whole dataframe?
>
> You do not understand factors. There is no "base type" that can be recovered.
>
> > f <- factor(c(5.1, 6.2), labels = c("whoa","baby")) f
> [1] whoa baby
> Levels: whoa baby
>
> > unclass(f)
> [1] 1 2
> attr(,"levels")
> [1] "whoa" "baby"
>
> > typeof(f)
> [1] "integer"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> On Sun, Sep 19, 2021 at 2:15 PM Avi Gross via R-help <r-help using r-project.org> wrote:
> >
> > Glad we have solutions BUT I note that the more abstract question is how to convert any columns that are factors to their base type and that may well NOT be character. They can be integers or doubles or complex or Boolean and maybe even raw.
> >
> > So undoing factorization may require using something like typeof() to get the base type and then depending on what final type you have, you may have to do things like as.integer(as.character(the_factor)) to get it as an integer and for a logical, as.logical(factor(c(TRUE, TRUE, FALSE, TRUE, FALSE))) and so on.
> >
> > This seems like a fairly basic need so I wonder if anyone has already
> > done it. I can see a fairly straightforward way to build a string and
> > use eval and I suspect others might use something else like do.call()
> > and yet others use multiple if statements or a case_when or something
> >
> >
> >
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Luigi
> > Marongiu
> > Sent: Sunday, September 19, 2021 4:43 PM
> > To: Rui Barradas <ruipbarradas using sapo.pt>
> > Cc: r-help <r-help using r-project.org>
> > Subject: Re: [R] how to remove factors from whole dataframe?
> >
> > Awesome, thanks!
> >
> > On Sun, Sep 19, 2021 at 4:22 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
> > >
> > > Hello,
> > >
> > > Using Jim's lapply(., is.factor) but simplified, you could do
> > >
> > >
> > > df1 <- df
> > > i <- sapply(df1, is.factor)
> > > df1[i] <- lapply(df1[i], as.character)
> > >
> > >
> > > a one-liner modifying df, not df1 is
> > >
> > >
> > > df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)],
> > > as.character)
> > >
> > >
> > > Hope this helps,
> > >
> > > Rui Barradas
> > >
> > > Às 11:03 de 19/09/21, Luigi Marongiu escreveu:
> > > > Thank you Jim, but I obtain:
> > > > ```
> > > >> str(df)
> > > > 'data.frame': 5 obs. of  3 variables:
> > > >   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> > > >   $ sales  : num  13 16 22 27 34
> > > >   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> > > >> df1<-df[,!unlist(lapply(df,is.factor))]
> > > >> str(df1)
> > > >   num [1:5] 13 16 22 27 34
> > > >> df1
> > > > [1] 13 16 22 27 34
> > > > ```
> > > > I was expecting
> > > > ```
> > > > str(df)
> > > > 'data.frame': 5 obs. of  3 variables:
> > > >   $ region : char "A","B","C","D",..: 1 2 3 4 5
> > > >   $ sales  : num  13 16 22 27 34
> > > >   $ country: char "a","b","c","d",..: 1 2 3 4 5 ```
> > > >
> > > > On Sun, Sep 19, 2021 at 11:37 AM Jim Lemon <drjimlemon using gmail.com> wrote:
> > > >>
> > > >> Hi Luigi,
> > > >> It's easy:
> > > >>
> > > >> df1<-df[,!unlist(lapply(df,is.factor))]
> > > >>
> > > >> _except_ when there is only one column left, as in your example.
> > > >> In that case, you will have to coerce the resulting vector back
> > > >> into a one column data frame.
> > > >>
> > > >> Jim
> > > >>
> > > >> On Sun, Sep 19, 2021 at 6:18 PM Luigi Marongiu <marongiu.luigi using gmail.com> wrote:
> > > >>>
> > > >>> Hello,
> > > >>> I woul dlike to remove factors from all the columns of a dataframe.
> > > >>> I can do it n a column at the time with ```
> > > >>>
> > > >>> df <- data.frame(region=factor(c('A', 'B', 'C', 'D', 'E')),
> > > >>>                   sales = c(13, 16, 22, 27, 34),
> > > >>> country=factor(c('a', 'b', 'c', 'd', 'e')))
> > > >>>
> > > >>> new_df$region <- droplevels(new_df$region) ```
> > > >>>
> > > >>> What is the syntax to remove all factors at once (from all columns)?
> > > >>> For this does not work:
> > > >>> ```
> > > >>>> str(df)
> > > >>> 'data.frame': 5 obs. of  3 variables:
> > > >>>   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> > > >>>   $ sales  : num  13 16 22 27 34
> > > >>>   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> > > >>>> df = droplevels(df)
> > > >>>> str(df)
> > > >>> 'data.frame': 5 obs. of  3 variables:
> > > >>>   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> > > >>>   $ sales  : num  13 16 22 27 34
> > > >>>   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> > > >>> ``` Thank you
> > > >>>
> > > >>> ______________________________________________
> > > >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> > > >>> see https://stat.ethz.ch/mailman/listinfo/r-help
> > > >>> PLEASE do read the posting guide
> > > >>> http://www.R-project.org/posting-guide.html
> > > >>> and provide commented, minimal, self-contained, reproducible code.
> > > >
> > > >
> > > >
> >
> >
> >
> > --
> > Best regards,
> > Luigi
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list