[R] FW: new to R: don't understand errors

Fridolin Wild fridolin.wild at wu-wien.ac.at
Wed Oct 4 11:58:51 CEST 2006


Hello Jerad,

> It was suggested I contact you for possible help with this issue. Well,
> as you can see for the emails below, that is what I was told at R-help.
> Any insight to my lsa problems (also listed below) would be of great
> help.

from what I see, the problem probably indeed lies within the
textfiles: for performance reasons, it was not possible to
include any "check" routines that exclude a file if it contains
no words (or words below a docFrequency) and thus produces
an empty column-vector.

I am pretty sure that you do not want to use docFrequency
with a value like 50 (it would mean that a term in a document
is only included if it appears more than 50 times in *that*
document).

I will send you the alpha-release of the updated lsa package
in a separate message which also includes a parameter called
minGlobFreq which is filtering out terms that appear less
than x times in the whole document collection. I guess that is
what you were looking for.

Considering the sanitizing: if you set minDocFreq to 1
and set minWordLength to 1, you should not get an error
with your document collection as you then are basically
taking everything (even a single character appearing
only once). It probably is not so problematic as the
LSA step will anyway group this low-frequency terms
in a lower order factor. Of course you will still get
an error if you use documents that are completely empty,
so delete all 0 bytes documents beforehands.

I am thinking about what to do with this sanitizing part.
It is not a good idea to integrate that into the
textmatrix method -- it would slow things down
tremendously.

So what about this idea: does it make sense to provide a
sanitizing collection of methods that help to select the
files you want to work with (copy them to a different
directory or just return a list with the filenames of
the ones that are "good")? What should we do with other
sanitizing options (deleting urls from texts, deleting
short words, etc.)?

Hope, I could be of help,

Best,
Fridolin

-- 
Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746



More information about the R-help mailing list