[R] How to reduce the sparseness in a TDM to make a cluster plot readable?

Fri Sep 18 14:02:31 CEST 2020

Hello Jim

Thanks for that. I'll read up on it and will give it a go, either later 
today or tomorrow. I am assuming this will work for both tf and tf-idf 
weighted TDMs?

Much appreciated. :-)

Best wishes
Andy

On 18/09/2020 09:18, Jim Lemon wrote:
> Hi Andrew,
> >From your last email the answer to your problem may be the
> findFreqTerms() function. Just increase the number of times a term has
> to appear and check the result until you get the matrix size that you
> want.
>
> Jim
>
> On Fri, Sep 18, 2020 at 5:32 PM Andrew <phaedrusv using gmail.com> wrote:
>> Hi Abby
>>
>> Many thanks for reaching out with an offer of help. Very much appreciated.
>>
>> (1) The packages I'm using are 'tm' for text-mining and the TDM and for
>> the clustering it is 'cluster'
>> (2) Not sure where the problem is happening as it doesn't show up as an
>> error. Where it manifests is in the plotting, however logic would
>> suggest that it concerns the removal of sparse terms, so that would be
>> in the TDM process
>> (3) I don't think I can provide a reproducible example. When I practice
>> using data sets that packages provide, all is fine. The trouble is when
>> I apply it to my own data sets which are five documents, etc., as described.
>>
>> I think the nub of it is really to find a way that I can subset the TDM
>> to return the twenty or thirty most frequently used words, and then to
>> plot those using hclust. However, when searching on-line I haven't been
>> able to find any suggestions on how to do that, nor is there any mention
>> of using that approach in the books and tutorials I have.
>>
>> If you (or someone on this list) can advise on how I can sort the terms
>> in the TDM from most to least frequent, and then to subset the top
>> twenty or thirty most frequently occurring terms (preferably using tf as
>> well as tf-idf) and then I can plot that sub-set, then I think that that
>> would do the trick, and the terms would be plotted clearly and legibly.
>>
>> Thanks again for your offer of help. I hope that my reply helps clarify
>> rather than muddy the situation.
>>
>> Best wishes
>> Andy
>>
>>
>> On 17/09/2020 08:43, Abby Spurdle wrote:
>>> I'm not familiar with these subjects.
>>> And hopefully, someone who is, will offer some better suggestions.
>>>
>>> But to get things started, maybe...
>>> (1) What packages are you using (re: tdm)?
>>> (2) Where does the problem happen, in dist, hclust, the plot method
>>> for hclust, or in the package(s) you are using?
>>> (3) Do you think you could produce a small reproducible example,
>>> showing what is wrong, and explaining you would like it to do instead?
>>>
>>> Note that if the problem relates to hclust, or the plot method, then
>>> you should be able to produce a much simpler example.
>>> e.g.
>>>
>>>       mycount.matrix <- matrix (rpois (25000, 20),, 5)
>>>       head (mycount.matrix, 3)
>>>       tail (mycount.matrix, 3)
>>>
>>>       plot (hclust (dist (mycount.matrix) ) )
>>>
>>> On Tue, Sep 15, 2020 at 6:54 AM Andrew <phaedrusv using gmail.com> wrote:
>>>> Hello all
>>>>
>>>> I am doing some text mining on a set of five plain text files and have
>>>> run into a snag when I run hclust in that there are just too many leaves
>>>> for anything to be read. It returns a solid black line.
>>>>
>>>> The texts have been converted into a TDM which has a dim of 5,292 and 5
>>>> (as per 5 docs).
>>>>
>>>> My code for removing sparsity is as follows:
>>>>
>>>>    > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
>>>>
>>>>    > inspect(tdm2)
>>>>
>>>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>>>> Non-/sparse entries: 10415/16045
>>>> Sparsity           : 61%
>>>> Maximal term length: 22
>>>> Weighting          : term frequency (tf)
>>>>
>>>> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
>>>>
>>>>    > inspect(tdm.tfidf)
>>>> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
>>>> Non-/sparse entries: 7915/18545
>>>> Sparsity           : 70%
>>>> Maximal term length: 22
>>>> Weighting          : term frequency - inverse document frequency
>>>> (normalized) (tf-idf)
>>>>
>>>> I have experimented by decreasing the value I use for decreasing
>>>> sparseness, and that helps a bit, for example:
>>>>
>>>>    > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
>>>>    > inspect(tdm2)
>>>> <<TermDocumentMatrix (terms: 869, documents: 5)>>
>>>> Non-/sparse entries: 3976/369
>>>> Sparsity           : 8%
>>>> Maximal term length: 14
>>>> Weighting          : term frequency (tf)
>>>>
>>>> But, no matter what I do, the resulting plot is unreadable. The code for
>>>> plotting the cluster is:
>>>>
>>>>    > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
>>>>    > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
>>>>
>>>> Can someone kindly either advise me what I am doing wrong and/ or
>>>> signpost me to some detailed info on how to fix this.
>>>>
>>>> Many thanks in anticipation.
>>>>
>>>> Andy
>>>>
>>>>
>>>>           [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.