[R] Removing words and initials with tm

Jim Lemon drjimlemon at gmail.com
Fri Apr 10 13:30:55 CEST 2015


Hi Sun,
Good thinking. Looking at your reply, I realized that you may be able to
run a spell checker over the output to pick up mangled words.

Jim


On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com> wrote:

>  Hey Jim
>
> So far I've re-run the process and sub'bed initials and proper names with
> blank space, and changed other names (including acronyms) to something less
> tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) *before* I
> converted to lower case. By and large, that seems to cut it, at least for
> my present purposes.
>
> I don't have a workaround for your e.g. #2 though!
>
> One really has to have a relatively decent handle on the scope of the
> variations and text content first. I'm not sure how one would do this kind
> of thing effectively on a large and unseen corpus.
>
> Anyway, thanks for your reply and thoughts.
>
> Sun
>
>
> On 10/04/15 11:38, Jim Lemon wrote:
>
> Hi Sun,
> In fact, case sensitivity is the default in functions like "sub". The
> problem may then become separating initials from acronyms if they are
> present in the corpus:
>
>  gsub("NM","","An NMR was performed on NM Jones")
> [1] "An R was performed on  Jones"
>
>  How you are going to deal with names like York may also be tricky:
>
>  gsub("York","","Reginald York took a holiday in New York.")
> [1] "Reginald  took a holiday in New ."
>
>  Jim
>
>
> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote:
>
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to remove words,
>> etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which are
>> peppered throughout the corpus. But, because some people's initials are the
>> same as parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' =
>> 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a
>> completely different meaning).
>>
>> Is there any way of doing this without leaving a trail of nonsense
>> half-terms behind? I suspect that it might have something to do with
>> regular expressions, but to be honest, I'm (currently) pretty crap with
>> those.
>>
>> Would it make a difference if I removed initials and names *prior* to
>> converting all text to lower case, so I remove 'AM' and because 'became' is
>> lower case, it should remain unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list