[R] For loop on column names

Bert Gunter gunter.berton at gene.com
Sat Jan 18 16:28:37 CET 2014


I doubt it.

1. The OP failed to specify how "populatedness" is defined. Is it
NULL, NA, "", " ",...?

2. What is percent() ? Is this the OP's function or one from a package
or pseudocode or ... ?

3.  lapply(df,function)
is generally preferable in R to:
for(name in colnames(df)) function(df[ ,name])

The former packages everything neatly in a list, while with the latter
you are stuck mucking about with canonical naming schemes and/or
assignments that may clutter up your workspace. The plyR package may
also be helpful her, especially for a novice.

Given the OP's admitted ignorance to both programming and R, it seems
to me that the obvious advice is to stop knocking around in the dark
this way and spend time with some R tutorials. A good R book, perhaps
tuned to his/her discipline, would probably also be a worthwhile
purchase.

Cheers,

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
H. Gilbert Welch




On Sat, Jan 18, 2014 at 2:40 AM, Frede Aakmann Tøgersen
<frtog at vestas.com> wrote:
> Hi
>
> Try
>
> for (cname in colnames(mydf))
>  print((percent(length(is.null(mydf [, cname]) / lines))
>
> Br. Frede
>
>
> -------- Oprindelig meddelelse --------
> Fra: Jeff Johnson
> Dato:18/01/2014 02.10 (GMT+01:00)
> Til: R help
> Emne: [R] For loop on column names
>
> I'm trying to find a more efficient to calculate the percent a field is
> populated and repeat it for each field (column).
>
> First, I'm counting the number of lines:
> lines <- as.integer(countLines(extract) - 1)
> dput(lines)
> 100000L
>
> extract <- 'C:/Users/jeffjohn/Desktop/batchextract_100k_sample.csv'
> mydf <- read.csv(file = extract, header = TRUE)
>
> Here's the list of columns in my file:
>> dput(colnames(mydf))
> c("PERSONPROFILE_POS", "PARTY_ID", "PERSON_FIRST_NAME", "PERSON_LAST_NAME",
> "PERSON_MIDDLE_NAME", "PARTY_NUMBER", "ACCOUNT_NUMBER", "ABILITEC_LINK",
> "ADDRESS1", "ADDRESS2", "ADDRESS3", "ADDRESS4", "CITY", "COUNTY",
> "STATE", "PROVINCE", "POSTAL_CODE", "COUNTRY", "PRIMARY_PER_TYPE",
> "SELLTOADDR_LOS", "LOCATION_ID", "SELLTOADDR_SOS", "PARTY_SITE_ID",
> "PRIMARYPHONE_CPOS", "CONTACT_POINT_ID_PCP", "CONTACT_POINT_PURPOSE_PCP",
> "PHONE_LINE_TYPE", "PRIMARY_FLAG_PCP", "PHONE_COUNTRY_CODE",
> "PHONE_AREA_CODE", "PHONE_NUMBER", "EMAIL_CPOS", "CONTACT_POINT_ID_ECP",
> "CONTACT_POINT_PURPOSE_ECP", "PRIMARY_FLAG_ECP", "EMAIL_ADDRESS",
> "BB_PARTY_ID")
>
> I want to count the percentage populated for each field. Rather than do:
> percent(length(is.null(mydf$PERSONPROFILE_POS)) / lines)
> percent(length(is.null(mydf$PARTY_ID)) / lines)
> etc.
> and repeat for each field manually, I want to use a for loop.
>
> I am trying the following:
> a <- length(colnames(mydf)) # this is to get the total number of columns
>
> for (i in 1:a)
>  print((percent(length(is.null(a)) / lines))
>
> which isn't correct. I'm new to programming, so I don't quite know how to
> deal with this. Any suggestions? Thanks much.
> --
> Jeff
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list