jeanhee.chung at yale.edu
Sat Jan 11 06:33:03 CET 2003
>1) I suggest you try a postscript() device, and convert later if you need
>to. Expect a very large file size.
Dear Dr. Ripley,
Thank you! Postscript was able to finish the job (bitmap killed itself.)
The filesizes are indeed large: 1.4G and requiring over two hours to
display by gv, but ultimately viewable. I'm new to manipulating ps files
but hopefully I can find a fast way to convert the files into a small
format. I found an archived message of yours that suggested not to use
pch="." as a symbol for graphing large datasets, and upon experimentation
I found that the default symbol, pch=21, seemed to produce the smallest
files for some sets of test data when compared with some other symbols.
Running "pch=21, cex=0.35" produced a fairly small point but consumed much
less space than pch="." Is this the best solution for producing plot
symbols that take up little room both on the plot and the hard drive?
>Sounds like the problem is in your X server and not in R. I've seen this
>with Xfree (and don't use that myself on Linux).
It's possible... however, I wouldn't know how to fix it from that end,
>2) Don't plot all the points. You say you have a `very large dataset'. In
>statistics, we give numbers, not vague descriptions. However, with what
>that means to me (many millions of rows) a scatterplot of a very large
>dataset is going to be mainly black at least in places. (We've
>experienced that with 1.4 million points, for example.) That's not a good
>way to display the data. Either use a density plot, or if you are
>interested in outliers, thin the centre. We did this by estimating a
>density phat, then randomly selecting points with probability min(1,
>const/phat(x)) for a suitable `const'
I have a set of textfiles, each containing a 450,000 x 41 matrix (1.845
million datapoints) and roughly 300M. Indeed, the scatterplots are
overprinted, but I am interested in getting a "feel" for the data before
charging ahead. The data (measurements on artificial phylogenetic trees)
were produced by simulation and although I have been running checks all
along I wanted to make sure that my simulations weren't producing any
strange outliers or oddly shaped distributions. On the other hand, I had no
real guess as to what the data would look like or even what variables would
show strong correlations. Since many of these datapoints are from repeats,
I was in fact able to discern a lot of pattern, rather than getting
Using both a density plot and a thinned plot may be the way to go, if I
don't find a way to shrink down the graphs. I hoped that "pairs" would be
a fast, one-line way to take in all my data at once, but of course nothing
has been that easy with all this data.
More information about the R-help