[R] tm package: handling contractions

Fri Jan 27 18:14:21 CET 2012

This may not be the answer to your problem but you could gsub out the "pretty apostrophe" for the one tm recognizes.  Also note that this may be due to your use of word which automatically uses the "pretty apostrophe".  The default setting on MS word can be altered to alleviate this.#===============================
#using gsub
x <-  "I didn’t know!"x <- gsub("’", "'", x)removePunctuation(x)#===============================#You could make that into a function and apply it to the corpus with tm_map
exchanger <- function(x) gsub("’", "'", x)corp <- tm_map(corp, exchanger)#===============================

Cheers,Tyler----------------------------------------
> Date: Fri, 27 Jan 2012 09:50:51 -0500
> From: friendly at yorku.ca
> To: r-help at r-project.org
> Subject: [R] tm package: handling contractions
>
> I tried making a wordcloud of Obama's State of the Union address using
> the tm package to process the text
>
> sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
>
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
>
> wordcloud(d$word,d$freq)
>
> I ended up with a large number of contractions that were split at the
> "’" character, e.g., "don’t" --> "don'"
> e.g.,
>
> > sotu[grep("’", sotu)]
> [1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
> [6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
> [11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
> [16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
> [21] "world’s" "what’s" "can’t" "that’s" "it’s"
> [26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
> [31] "you’re" "it’s" "i’ll" "we’re" "don’t"
> [36] "we’ve" "it’s" "it’s" "it’s" "they’re"
> ...
> [201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
> [206] "i’m" "other’s" "we’re"
> >
>
> NB: What appears as the ' character above actually the character hex 92,
> not hex 27 on my Windows system.
>
> This should be a common problem in text processing, but I don't see a
> transformation in the tm package that
> handles this nicely. Is there something I've missed?
>
> -Michael
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street Web: http://www.datavis.ca
> Toronto, ONT M3J 1P3 CANADA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.