[R] Excluding "small data" from plot.

Kieran kroberts012 at gmail.com
Wed Feb 17 10:19:47 CET 2016


To R-help users:

I want to use ggplot two plot summary statistics on the frequency of
letters from
a page of text. My data frame has four columns:

(1) The line number [1 to 30]
(2) The letter [a to z]
(3) The frequency of the letter [assuming there is 80 letters per line]
(4) The factor 'type': bad or good (purely artificial factor)

I want to achieve the following plot:

(a) Bar plot with an x-axis to be the letters and the y-axis the sum of
30 letter frequencies from each line of each letter.
(b) Split each bar (for a letter) into two bars for 'good' and 'bad' types.
(c) Display the union of the top 8 most frequency used letters for both types
'good' and 'bad'.

By point (c) I mean: if a,e,f,h,i,t,s,r are the most frequent letter of type
'good' and a,e,f,h,i,m,l,p are the most frequent letter of type 'bad'. Then
I would like my plot to feature the letters a,e,f,h,i,t,s,r,m,l,p.

Here is my code:

# There will be 30 lines and we want to record the frequency of each letter
# on each line.

lines <- c(rep(1:30, each=26))
letter <- c(rep(letters, times=30))

# We have taken the letter frequencies from
# http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html

freq <- c(8.12, 1.49, 2.71, 4.32, 12.02, 2.30, 2.03, 5.92, 7.31, 0.10, 0.69,
3.98, 2.61, 6.95, 7.68, 1.82, 0.11, 6.02, 6.28, 9.10, 2.88, 1.11, 2.09, 0.17,
2.11, 0.07)
freq <- freq/100


# We assume each line contains 80 letters and change the seed for each line
# for variability.

letterfreq <- integer()
for (i in 1:30) {
    set.seed(i)
    s<-data.frame(sample(letters, size = 80, replace = TRUE, prob = freq))
    names(s) <- "ltr"
    s$ltr <- factor(s$ltr, levels = letters)
    frq<-as.data.frame(table(s))
    letterfreq <- append(letterfreq, frq$Freq)
}

ltrfreq <- data.frame(lines, letter, letterfreq)

# Add an artificial factor column _type_: good/bad. So each pair
# (week, letter) has type 'good' or 'bad' with equal probability.
# Set the seed for reproducibility.

set.seed(999)
ltrfreq$type <-  factor(sample(c("good","bad"), size = 780, replace = TRUE,
    prob = c(0.5,0.5)))


# Here is the plot I want but this includes all 26 letters.

ggplot(ltrfreq,aes(x=factor(letter),y=letterfreq, fill=type), color=type) +
  stat_summary(fun.y=sum,position=position_dodge(),geom="bar")

Best regards,
Kieran.



More information about the R-help mailing list