[R] Analysis of pre-calculated frequency distribution?

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sun Nov 21 15:35:07 CET 2004


Sorry for the dumb question, but I cant work out how to do this. 

Quick version, 

How can I re-bin a given frequency distribution using new breaks without
reference to the original data? Given distribution has integer valued
bins.


Long version,

I am loading a frequency table into R from a file. The original data is
very large, and it is a very simple process to get a frequency
distribution from an SQL database, so in all this is a convenient method
for me. Point being I don't start with 'raw' data.

The data looks like this...

> dat
             COUNT FREQUENCY
1                1 5734
2                2 1625
3                3  793
4                4  480
5                5  294
6                6  237
7                7  205
8                8  200
9                9  123
10              10  108
11              11   90
12              12   62
13              13   60
14              14   68
15              15   64
16              16   56
17              17   68
18              18   45
19              19   38
20              20   37
21              21   29
22              22   39
23              23   35
24              24   33
25              25   36
...
148            153    5
149            156    2
150            157    3
151            158    2
152            159    2
153            162    1
154            163    3
155            164    3
156            165    2
157            166    1
158            168    2
159            169    4
160            170    1
...
354           2106    1
355           2189    1
356           2194    1
357           2217    1
358           2246    1
359           2474    1
360           2801    1
361           3697    1
362           3702    1
363           7353    1
364           8738    1
365           9442    1
366          12280    1



This is a tipical 'count / frequency' distribution in biology, where low
counts of a certain property are very frequent (across genomes, proteins,
ecosystems, etc...), and high counts of of a certain property are very
rare.

In the above example a certain property occurs 12280 times with a
frequency of 1, another property occurs 9442 times with the same
frequency. At the other end of the extreem, a certain property occurs once
with a frequency of 5734, and another property occurs twice with a
frequency of 1625. 

This kind of distribution is variously known as a "zipf", a "power law", a
"Pareto", "scale free", "heavy tailed" or a "80:20" distribution, or
coloquially "the dominance of the few over the many". The term I choose is
a "log linear" distribution, because that makes no assumptions about the
underlying cause of the overall shape.

People tipically quote the curve in the form of y ~ Cx^(-a). I want to use
the binning method of parameter estimation given here...

http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Pareto%20-%20a%20ranking%20tutorial.htm

(bin the data with exponentially increasing bin widths within the data
range).

But I can't work out how to re-bin my existing frequency data.

Sorry for the long question, 
all the best
Dan.




More information about the R-help mailing list