[R] Approach for Storing Result Data

Wed Mar 8 16:27:08 CET 2017

Hi All,

today I have a more general question concerning the approach of storing 
different values from the analysis of multiple variables.

My task is to compare distributions in a universe with distributions from 
the respondents using a whole bunch of variables. Comparison shall be done 
on relative frequencies (proportions).

I was thinking about the structure I should store the results in and came 
up with the following:

-- cut --

library(stringi)

# Result data frame
# Some sort of tidytidy data set where
# each value is stored as an identity.
# This way all values for all variables could be stored in
# one unique data structure.
# If an additional variable added for the name of the
# research one could also build result data set across
# surveys.
# Values for measure could be "number" for 'raw' values or
# "freq" for frequencies/counts.
# Values for unit could be "n" for 'numbers' and
# "%" for percentages.
d_test <- data.frame(
    group = rep(c("Universe", "Respondents"), each = 16),
    variable = rep("State", 32),
    value = rep(c(11.3,
                    12.7,
                    3.3,
                    5,
                    0.6,
                    8.1,
                    6.2,
                    5.8,
                    6.4,
                    14.5,
                    8.3,
                    0.3,
                    3.8,
                    2.5,
                    8.1,
                    3), 2),
    label = rep(c("Baden-Wuerttemberg",
                "Bayern",
                "Berlin",
                "Brandenburg",
                "Bremen",
                "Hamburg",
                "Hessen",
                "Mecklenburg-Vorpommern",
                "Niedersachsen",
                "Nordrhein-Westfalen",
                "Rheinland-Pfalz",
                "Saarland",
                "Sachsen",
                "Sachsen-Anhalt",
                "Schleswig-Holstein",
                "Thueringen"),2),
    measure = rep("freq", 32),
    unit = rep("%", 32),
    stringsAsFactors = FALSE
)

# This way the variables can be selected using simple
# value selection from Base R functionality.
data <- d_test[d_test$variable == "State" ,]

# And plot results for every variable.
ggplot(
  data = data,
  aes(
    x = label,
    y = value,
    fill = group)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_discrete(name = stringi::stri_trans_totitle(names(data)[1])) 
+
  scale_x_discrete(name = data$variable[1]) +
  scale_y_discrete(name = data$unit[1])

-- cut --

The reporting / presentation is done in R Markdown. I would load the 
result data set once at the beginning and running the comparisons as plots 
on each variable named in the results data set under "variable".

If I follow this approach for my customer relationship survey, do think I 
would face drawbacks or run into serious trouble?

I am interested in your opinion and open for other approaches and 
suggestions.

Kind regards

Georg