[R] script to data clear

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Tue Aug 12 17:06:49 CEST 2014


Without a representative sample of data, it is very hard to understand your question or to be specific about suggestions. See [1] for some ideas about how to communicate questions online.

Not that "clearing" data would usually mean deleting it, as in rm(data). From context I assume you mean "cleaning", where invalid characters need to be removed.

Also assuming that you have a data frame with some columns that are categorical data:

1) If the values are contaminated or incomplete (don't have rows representing every possible category) then it is almost always better to delay converting to factor until after data are cleaned. The read.table family of functions include a "stringsAsFactors=FALSE" option that will prevent automatic conversion of columns with unknown types into factors. This is also useful for contaminated numeric columns. Only after the vector of character data is clean and as complete as it can be should you convert to factor.

Note that most data sets have a variety of column types, and even after resolving issues discussed here your function is not necessarily going to work with every input data file that you encounter. Specifically, not every column of data should be converted to factor. With this in mind, it can be helpful to look for ways to confirm that the date you are processing is what you expect it to be. Often this is implemented by confirming that specific columns have specific kinds of data in them. That is using a loop may be TOO flexible... apply this cleaning loop cautiously.

2) Most functions in R can process whole vectors of data at once, so your inner loop should not be necessary. Specifically, the line

data[[i]] <- gsub( " +", " ", data[[i]] )

would replace all sequences of one or more spaces in every element of the vector with a single space.

(Your j loop also goes too many times... str_replace_all(data[[i]], "  ", " ") is affecting the whole column, but you repeat it unnecessarily.)

 3) I don't know what a "depurate" value is.

4) You should be able to convert your cleaned character column to factor with the "factor" function... like

data[[i]] <- factor( data[[i]] )

Note that if you know certain levels should be possible but not all of them are actually present (e.g. "Small", "Medium", and "Large" but no data with "Small" are present) then you will need to specify the levels as a parameter to the factor function. See the help file ?factor.

5) You have several lines of code at the end that appear to execute regardless of whether the column is a factor or not. They should be within the braces of the if statement.

6) Please read the Posting Guide mentioned at the end of this and every post on this list, specifically regarding posting in plain text. Your code was partially damaged by the HTML email format.

[1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On August 12, 2014 5:42:13 AM PDT, "Maicel Monzón Pérez" <maicel at infomed.sld.cu> wrote:
>Hello List,
>
>I did this script to clear data after import (I don�t know is ok ).
>After
>its execution levels and label values got lost. Could some explain me
>to
>reassign levels again in the script (new depurate value)? 
>
>Best regard
>
>Maicel Monzon MD, PHD
>
>Center of Cybernetic Apply to Medicine
>
># data cleaning  script
>
>library(stringr)
>
>for(i in 1:length(data)) { 
>
>  if (is.factor(data[[i]])==T) 
>
>  {for(j in 1:sum(str_detect(data[,i], "  "))) 
>
>  {data[[i]]<-str_replace_all(data[[i]], "  ", " ")}}
>
>  data[[i]]<-str_trim (data[[i]],side = "both")
>
>  data[[i]]<-tolower(data[[i]])
>
>}
>
>Note: �   � is 2 blank space  and � �  only one
>
> 
>
>
>
>--
>Nunca digas nunca, di mejor: gracias, permiso, disculpe.
>
>Este mensaje le ha llegado mediante el servicio de correo electronico
>que ofrece Infomed para respaldar el cumplimiento de las misiones del
>Sistema Nacional de Salud. La persona que envia este correo asume el
>compromiso de usar el servicio a tales fines y cumplir con las
>regulaciones establecidas
>
>Infomed: http://www.sld.cu/
>
>
>
>
>	[[alternative HTML version deleted]]
>
>
>
>------------------------------------------------------------------------
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list