[R] Using a text file as a removeWord dictionary in tm_map

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Tue Mar 3 16:47:47 CET 2015


You seem to be conflating the data input operation with your data processing. You need to stop and examine the in-memory representation of your data {"userStopList"), and compare it with the expectations of your data processing operation ("tm_map"). Then adjust your input data by choosing a different input function or change the parameters you are using, or add some manipulation step in between that "fixes" the data so it is suitable to the task.
Real world data analysis is rarely handled cleanly by one or two function calls. Take responsibility for insuring the quality of your data by looking at it.
I am not familiar with tm_map, but scan tends to be quite literal with your specification. I suspect that the elements in your userStopList have spaces in them because scan is not removing them, and tm_map is then only looking for instances in your text where spaces are present.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On March 3, 2015 6:43:39 AM PST, Sun Shine <phaedrusv at gmail.com> wrote:
>Hi again
>
>I've now had the chance to try this out, and using scan() doesn't seem 
>to work either.
>
>This is what I used:
>
>1) I generated a plain text file called stopDict.txt. This file is of 
>the format: "a, bunch, of, words, to, use"
>
>2) I invoked scan(), like this:
>> userStopList <- scan(text = '~/path/to/stopDict.txt', what = " ", sep
>
>= ",")
>
>3) Then I used the externally generated list as stop words:
> > docs <- tm_map (docs, removeWords, userStopList)
>
>3) When I go to inspect the document, at least two of the user-defined 
>stop words are in the text
>
>Is there a further argument I should be passing to scan(), or is the 
>stopDict.txt file not set up the correct way? I tried each term 
>separated by ' ' and ',', (e.g. 'all', 'the', 'text') but that didn't 
>work, neither does it seem to work when the whole list is enclosed 
>within quotes (e.g. "all, the, text").
>
>While not critical to have the capacity to read in an externally 
>generated list, it sure would be helpful.
>
>Thanks.
>
>Sun
>
>
>On 02/03/15 07:36, Sun Shine wrote:
>> Thanks Jim.
>>
>> I thought that I was passing a vector, not realising I had converted 
>> this to a list object.
>>
>> I haven't come across the scan() function so far, so this is good to 
>> know.
>>
>> Good explanation - I'll give this a go when I can get back to that 
>> piece of work later today.
>>
>> Thanks again.
>>
>> Regards,
>>
>> Sun
>>
>>
>> On 01/03/15 21:13, jim holtman wrote:
>>> The 'read.table' was creating a data.frame (not a vector) and
>applying
>>> 'c' to it converted it to a list.  You should alway look at the
>object
>>> you are creating.  You probably want to use 'scan'.
>>>
>>> ======================
>>>> testFile <- 
>>>> "Although,this,query,applies,specifically,to,the,tm,package"
>>>> # read in with read.table create a data.frame
>>>> df_words <- read.table(text = testFile, sep = ',')
>>>> df_words  # not a vector
>>>          V1   V2    V3      V4           V5 V6  V7 V8      V9
>>> 1 Although this query applies specifically to the tm package
>>>> c(df_words)  # this results in a list
>>> $V1
>>> [1] Although
>>> Levels: Although
>>> $V2
>>> [1] this
>>> Levels: this
>>> $V3
>>> [1] query
>>> Levels: query
>>> $V4
>>> [1] applies
>>> Levels: applies
>>> $V5
>>> [1] specifically
>>> Levels: specifically
>>> $V6
>>> [1] to
>>> Levels: to
>>> $V7
>>> [1] the
>>> Levels: the
>>> $V8
>>> [1] tm
>>> Levels: tm
>>> $V9
>>> [1] package
>>> Levels: package
>>>> # now read with 'scan'
>>>> scan_words <- scan(text = testFile, what = '', sep = ',')
>>> Read 9 items
>>>> scan_words
>>> [1] "Although"     "this"         "query"        "applies"
>>> "specifically" "to"
>>> [7] "the"          "tm"           "package"
>>>>
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>>
>>>
>>> On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com>
>wrote:
>>>> Hi list
>>>>
>>>> Although this query applies specifically to the tm package, perhaps
>
>>>> it's
>>>> something that others might be able to lend a thought to.
>>>>
>>>> Using tm to do some initial text mining, I want to include an 
>>>> external (to
>>>> R) generated dictionary of words that I want removed from the
>corpus.
>>>>
>>>> I have created a comma separated list of terms in " " marks in a
>>>> stopList.txt plain UTF-8 file. I want to read this into R, so do:
>>>>
>>>>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
>>>> When I want to load it as part of the removeWords function in tm, I
>do:
>>>>
>>>>> docs <- tm_map(docs, removeWords, stopDict)
>>>> which has no effect. Neither does:
>>>>
>>>>> docs <- tm_map(docs, removeWords, c(stopDict))
>>>> What am I not seeing/ doing?
>>>>
>>>> How do I pass a text file with pre-defined terms to the removeWords
>>>> transform of tm?
>>>>
>>>> Thanks for any ideas.
>>>>
>>>> Cheers
>>>>
>>>> Sun
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list