[R] Creating a data frame from scratch

G.Maubach at gmx.de G.Maubach at gmx.de
Tue May 24 20:54:44 CEST 2016


Hi All,

I need to create a data frame from scratch and fill variables created on the fly with values. What I have so far:

-- schnipp --

# Example dataset
gene <- c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000207604", 
  "ENSG00000207431","ENSG00000221312","ENSG00134940305","ENSG00394039490",
  "ENSG09943004048")
hsap <- c(0,0,0, 0, 0, 0, 1,1, 1)
mmul <- c(NA,2 ,3, NA, 2, 1 , NA,2, NA)
mmus <- c(NA,2 ,NA, NA, NA, 2 , NA,3, 1)
rnor <- c(NA,2 ,NA, 1 , NA, 3 , NA,NA, 2)
cfam <- c(NA,2,NA, 2, 1, 2, 2,NA, NA)

ds_example <- data.frame(gene, hsap, mmul, mmus, rnor, cfam)
ds_example$gene <- as.character(ds_example$gene)

t_count_na <- function(dataset,
                       variables = "all")
  # credit: http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame
  {
  ds_na <- data.frame()
  # if variables = "all" create character vector of variable names
  if (variables == "all") {
    variable_list <- dimnames(dataset)[[ 2 ]] 
  }
  # if a character vector with variable names is given
  # to run the function on a defined set of selected variables
  else {
    variable_list <- variables
  }
  
  for (var in variable_list) {
    new_name <- paste0("na_", var)
    ds_na[[ new_name ]] <- as.data.frame(is.na(dataset[[ var ]]))
  }
  
  ds_na[[ "na_count" ]] <- rowSums(ds_na)
  return(ds_na)
}

test <- t_count_na(dataset = ds_example, variables = c("mmul", "mmus"))

-- schnipp --

gives:

 Error in `[[<-.data.frame`(`*tmp*`, new_name, value = list(`is.na(dataset[[var]])` = c(TRUE,  : 
  replacement has 9 rows, data has 0 In addition: Warning message:
In if (variables == "all") { :
  the condition has length > 1 and only the first element will be used

My goal is to create a dataset from scratch on the fly which has the same amount of variables as the dataset ds_example plus a single variable storing the amount of NA's in a row for the given variables. This is the basis for a decious which cases to keep and which to drop.

I do not want to alter the base dataset like ds_example in the first place nor do I want to make a copy of the existing dataset due to memory allocation. The function shall also work with big data, e. g. datasets with more than 1 GB memory consumption.

I also do not want the newly created variables to be stored in the original data frame. They shall be separate.

A former similar solution worked:
http://r.789695.n4.nabble.com/Creating-variables-on-the-fly-td4720034.html

Why doesn't this one?

How do I create the variables within the data frame if the data frame is empty?

Kind regards

Georg Maubach



More information about the R-help mailing list