[R] working with texts

Thu Jul 2 22:52:07 CEST 2009

 From the CRAN task view on natural language processing:

(package) tm  provides a comprehensive text mining framework for R. 
The  Journal of Statistical Software  article  Text Mining 
Infrastructure in R  gives a detailed overview and presents 
techniques for count-based analysis methods, text clustering, text 
classification and string kernels.

Worth looking into?

-Don

At 12:59 PM -0700 7/2/09, Helter Two wrote:
>WinXP, R-2.9.1
>
>LS.,
>
>I have been trying to solve a (for me) tricky issue. No matter what I've
>tried, I just can't find a way to do this.
>This is the issue:
>
>I have a text file (ansi text) "titles.txt" with lines of text; here is
>an example of such a file:
>
>>>>>>
>a brief history of polio vaccines
>anti-vaccination movements and their interpretations
>early warning in the light of theories of technological change
>international mobility among nordic doctoral students
>land of hope and glory: exploring cochlear implantation in the
>netherlands
>making science - between nature and society
>medical innovations in historical-perspective
>photographing medicine - images and power in britain and america since
>1840
>shifts in global immunisation goals (1984-2004): unfinished agendas and
>mixed results
>striking the mother lode in science - the importance of age, place, and
>time
>technology assessment and the sociopolitics of health technologies
>the policy of science and technology - evolution of research policy -
>france, the united-kingdom, the federal-republic-of-germany, japan, the
>united-states - french
>vaccine independence, local competences and globalisation: lessons from
>the history of pertussis vaccines
>external assessment and conditional financing of research in dutch
>universities
>histories of cochlear implantation
>lock in, the state and vaccine development: lessons from the history of
>the polio vaccines
>peerless science - peer-review and united-states science policy
>technology, science, and obstetric practice - the origins and
>transformation of cephalopelvimetry
>the rhetoric and counter-rhetoric of a ''bionic'' technology
>vaccine innovation and adoption: polio vaccines in the uk, the
>netherlands and west germany, 1955-1965
><<<<<
>
>Some of the lines in such a file are very long (not in this example).
>The file contains titles and abstracts of scientific articles.
>
>In addition to this file, I also have a file "words.txt" that includes a
>set of words I want to analyze. Part of this file:
>  >>>>>
>technology
>technological
>innovations
>science
>policy
>society
>history
><<<<<
>
>What I want is to create a matrix in which cell [i,j] contains the
>number of times word i (i.e the ith word from "words.txt") appears in
>line j of "titles.txt".
>
>So, for the data above this would yield (barring any typos on my side):
>0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
>0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
>0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0
>0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0
>0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
>
>This is the precursor to co-word analysis and some basic statistics on
>these titles and abstracts.
>I have always had a hard time working with text in R and still have no
>idea how to achieve the results above. I am probably overlooking
>something pretty straightforward. But right now, I am completely in the
>dark.
>
>Any help is very much appreciated,
>
>Peter Verbeet
>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://*stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
925-423-1062