number spaces from reals
follow newcomb-benford and zipf. An indications that both laws are equal for the first
digit distribution. More emirical evidence is given as is the mathematical
proof. The origin of compliance of the Reals with the laws comes from the fact
that the powers of two, in other words the exponent part for the normal reals
formula, give an Newcomb-Benford distribution and that Newcomb-Benford
distributions are scale invariant.
SATOCONOR .COM
J.G.
van der Galiën ‘Number Spaces Of Reals Follow Newcomb-Benford’s And Zipf’s Law’
4.1. (2005)
Full paper
SATOCONOR.COM Journal
of Computer Science
Number Spaces Of
Reals Follow Newcomb-Benford’s And Zipf’s Law
Newcomb-Benford’s
law is a special case of Zipf’s law, proven empirically and mathematically.
By Johan G. van der
Galiën (M.Sc.)
For comments: galien8@zonnet.nl
Abstract:
Non-regular Reals with an fixed 11 bits
exponent (E) and a variable mantissa (M) of 4, 8, 12, 16, 20 or 24 bits were
used to produce number number spaces between 1e-0307 and 1e+0308 in computer
memory. The number spaces were checked for Uniform, Newcomb-Benford and Zipfian
distribution of the first digits. The coefficient for Zipfian correlation from
Linear Regression was always around 0.9992 and converged with a increasing
number of mantissa bits. Uniform and Newcomb-Benford conformity was tested with
the Goodness-Of-Fit method. A Uniform distribution was totally out of the
question because the χ2‘s were always bigger than 104.
Newcomb-Benford conformity was initially good but seemed to diverge. These
empirical results can be explained in terms of that there always will be a
significant χ2 deviation if the sample size is large enough and the
difference between artificial and real data sets. Using the good fits with Zipf
and Newcomb-Benford, for the 8 and 16 bits mantissae, in a new kind of
randomness test is proposed. The origin of the good correlation for both laws
must come from the fact that the exponent part of the Reals formula, in other
words the integer powers of two, gives itself a nearly perfect Newcomb-Benford
distribution (χ2 ≈ 0). It must be this fact because the mantissa part
is of course uniformly distributed over the significant digits space between 1
and 2. Mathematical proof for this comes from interpreting the literature. For
it is proven that the Newcomb-Benford distributions are scale independent.
Empirical and mathematical proof is given that the Law of Newcomb-Benford is an
(approximate) special case of Zipf’s Law. This explains the fact that for 4-20
bits mantissae there is a good fit for both laws.
1. Introduction
The Law of Newcomb-Benford is also known as the
First Digit Law, the Most Significant Digit Law and the Significant Digit Law.
But most of the time it is called the Law of Benford. But this does not do
justice to Newcomb who discovered and described this law for the first time.1
Benford, seemingly unaware of this, also discovered this law and did
measurements on real data sets and published about it in 1938.2 So I
will call it the law of Newcomb-Benford in this paper, in the hope that
everybody who reads this will follow.
fD
= Log10(1+1/D) (1.1)
fD
is the relative frequency of the first digit D in the data set
D is the
first digit as D Î {1, 2, 3, 4, 5, 6, 7, 8, 9}
The
Law of Zipf was discovered from ranking word count frequencies. Zipf discovered
that they obeyed approximately the following formula:
fn
≈ cn-a (1.2)
fn
is the relative frequency for the word that ranks in the nth place
n is the
ranking (of course a positive non-zero integer)
The
harmonic number c and the a are positive Reals and constant for a given data
set
If we had infinite many words in the English
Language than a of (1.2) would be > 1 and c would be the value of the
Riemann Zeta function at a.3,4 These formula's are actually
probability mass functions and summations are instances of the so called
cumulative probability function. For these are discrete laws despite the fact
that (1.1) also works for any integer kind of initial digit sequence. (D can
then be the first two digits like 14 or the first four digits like 9876 and so
one.)
Reals can be divided in to Zeroes, Infinities, Normals, Denormals and
NaN’s. To understand this you need to think in the bits reserved for the sign
(S), exponent (E) and mantissa (M). For the C, C++, C#, PASCAL, PASCAL and
VISUAL BASIC Double (64 bits real) data type this is illustrated in Fig. 1.
|
S 1 bit |
E 11 bits |
M 52 bits |
E = 0 to
2047
M = 0 to
4503599627370495
Double
is a Log10(2P) = Log10(4503599627370496) = 15
– 16 digits significant real
Normals
0 < E < 2047 real = (-1)S2E-B(1+M/2P) B is the exponent bias is set to 1023 (1.3)
P is the mantissa resolution is 52
in the case of double
Denormals
E = 0 M<>0 real = (-1)SM21-B-P (1.4)
Zeroes E
= 0 M = 0 real = (-1)S0 = 0 (1.5)
Infinities
E = 2047 M = 0 (-1)SInfinity = ±Infinity (1.6)
NaN’s
(Not a Number) E = 2047 M <> 0
Range
Normals and Denormals is ±4.940656458412465e-0324 to ±1.797693134862315e+0308
Fig. 1: The specification of the 64 bits
Double real data type.5
I used 11 bits E and 4, 8, 12, 16, 20 and 24
bits M for my research described in this paper. The S was always zero, so all
the numbers are positive. How to calculate the values of the Reals is given in
the following example for 11 bits exponent (E) and 8 bits mantissa (M) (Fig.
2).
|
S = 0 |
E = 00000000100 |
M = 10110111 |
M-binary
= 10110111 M = 1*27+0*26+1*25+1*24+0*23+1*22+1*21+1*20
= 183
E-binary
= 00000000100 E = 1*22 = 4
I only
used the Normals, which is (2046/2048)*100 = 99.90% of the total Real number
space. The formula is then:
Real =
(-1)024-1023(1+183/28) = 3.05e-0307 (1.7)
Since
Log10(256) = 2.408 So the 11 bits E and 8 bits M Real data type has
2 to 3 significant digits
Fig. 2: How to calculate the value of a Real from a
bit stream
2. Methods and
Materials
From Reals (with the S bit is always zero, an
11 bits E and a M of 4, 8, 12, 16, 20 and 24 bits in the range of 1e-0307 to
1e+0308) the cumulative frequency of the most significant digits were measured
for the complete number space. The range assured that if the H0 (the
distribution is Uniform) was true the digits had an equal change to occur. A
Goodness-Of-Fit test with 8 degrees of freedom was applied for all mantissae
for the H0 (the distribution follows the First Digit Law).
Additionally Linear Regression was conducted on the nine Log10’s of
the relative frequencies by scatter plotting the data against Log10(D).
For correlation with Zipf. The programs were written in PASCAL and run on a
Pentium-4 2.8GHz 512Mb machine. To give you an indication: The longest run, for
a mantissa of 24 bits, took 13 hours 29 minutes and 14 seconds.
3. Results
|
Mantissa bits |
4 |
8 |
12 |
|
|
|
|
|
Correlation
Uniform
|
|
|
|
|
χ2 |
15987 |
255793 |
4092690 |
|
|
|
|
|
Correlation
Benford
|
|
|
|
|
χ2 |
0.006 |
0.002 |
0.025 |
|
|
|
|
|
Correlation Zipf
|
|
|
|
|
Slope
(a) |
-0.863724554823330 |
-0.863681236955406 |
-0.863677897472362 |
|
Intersection
(b) |
-0.503723354973614 |
-0.503744000092615 |
-0.503745353279953 |
|
Correlation
coefficient (R) |
-0.999226021417447 |
-0.999228913145435 |
-0.999228872179429 |
|
|
|
|
|
|
Significant
digits |
1-2 |
2-3 |
3-4 |
|
|
|
|
|
|
Number
space (number of Reals) |
32688 |
523005 |
8368082 |
|
|
|
|
|
|
Mantissa bits |
16 |
20 |
24 |
|
|
|
|
|
Correlation
Uniform
|
|
|
|
|
χ2 |
65482838 |
1047725347 |
7076226062 |
|
|
|
|
|
Correlation
Benford
|
|
|
|
|
χ2 |
0.252 |
4.026 |
64.299 |
|
|
|
|
|
Correlation Zipf
|
|
|
|
|
Slope
(a) |
-0.863673707385492 |
-0.863673667899045 |
-0.863673655978671 |
|
Intersection
(b) |
-0.503747335158292 |
-0.503747352336366 |
-0.503747358013820 |
|
Correlation
coefficient (R) |
-0.999229089528327 |
-0.999229090155236 |
-0.999229090810492 |
|
|
|
|
|
|
Significant
digits |
4-5 |
5-6 |
6-7 |
|
|
|
|
|
|
Number
space (number of Reals) |
133889326 |
2142229211 |
34275667382 |
Table 1: The results for the Newcomb-Benford
and Zipf analysis of 11 bits exponent Reals with different amounts of mantissa
bits.
----------------------------
Newcomb-Benford
and Zipf tests on the integer powers of two, Reals as Extended
----------------------------
Total
amount of powers tested = 32766
----------------------------
Test for
the fit with the Law of Newcomb-Benford
CHI-Newcomb-Benford
= 7.48707874884128e-0003
CHI 8
degrees of freedom 5% = 2.73
CHI 8
degrees of freedom 95% = 15.51
----------------------------
Test for
correlation with Zipf
Slope =
-8.63808338620714e-0001
Intersection
= -5.03682893499106e-0001
Correlation
coefficient (R) = -9.99220976268348e-0001
----------------------------
Time
begin = 8/30/2005 9:58:39 PM
Time
end = 8/30/2005 9:58:39 PM
----------------------------
e1 =
9863
e2 =
5772
e3 =
4092
e4 =
3175
e5 =
2597
e6 =
2193
e7 =
1899
e8 =
1677
e9 =
1498
----------------------------
Fig. 3: The output in the logfile from the
Zipf and Newcomb-Benford analysis of the integer powers of two between 2-16383
and 216383
4. Discussion
χ2 will always detect a
significant difference when a large enough sample is used. This is a well-known
phenomenon.7 The Goodness-Of-Fit test must be too strict in the case
of mantissae larger than 20 bits. In other words sample space bigger than
2142229211 Reals. The only conclusion can be that all tested mantissae maybe
comply with the Most-Significant-Digit Law. From Table 1 it is clear that the
first digit distribution converges to a small extent with Zipf by increasing
the mantissa resolution.
The χ2 Benford are below the 5%
probability for the 4-16 bits mantissae. In other words too good to be true.
But the Goodness-Of-Fit method is actually not mend for complete number spaces but
for relative small random samples drawn from them.
In my opinion if the size of the exponent and
the size of the mantissa increases so will the relative frequencies of the most
significant digits converge to the expected values and the Zipf correlation
coefficient will converge to –1.
The fact that the number spaces of the tested
Reals comply with Newcomb-Benford and Zipf can be understood from the fact that
the integer powers of two themselves have a Newcomb-Benford distribution.
Multiplying by a real number (1+M/2P) does not change this fact as
Hill demonstrated in his famous article. For he has proven that Newcomb-Benford
distributions are scale invariant.8 That it also complies with the
law of Zipf comes from the fact that the Newcomb-Benford Law is actually an
(approximate) special case of Zipf’s Law. (As I will demonstrate later on.)
Which itself is an approximate and experimental equation.
One can base a randomness test suite on the
Goodness-Of-Fit with Newcomb-Benford and the correlation with Zipf of 8 bits E
and 8 bits M or 8 bits E and 16 bits M etc. As long as the E MOD 8 = 0, M MOD 8
= 0 and if the amount of bits does not exceed the standard Integer data types
of the mentioned programming languages. Also care must be taken that two is not
raised to a power, for power part of the Real, exceeding the range of the used
Real data type. Then programming such a random test is easy. I did this for E
and M is 8 bits and also programmed it that it divides the sample from a
(Pseudo) Random Number Generator into 2 or up to 1000 parts. Averaging the
χ2 and R. A similar test based on the powers of two is also in
the suite. The
criterion of the R for a (near) infinite random sample is that it must be
(almost) equal to the R for the total data set. Because an infinite random file
will give the same R as the total data set.
As I postulated earlier Newcomb-Benford is an
(approximate) special case of Zipf. Here below is the empirical and
mathematical proof:
|
D (=digit) |
Log10(1+1/D) |
Log10(D) |
Log10(log10(1+1/D)) |
|
1 |
0.3010299957 |
0 |
-0.5213902277 |
|
2 |
0.1760912591 |
0.3010299957 |
-0.7542622013 |
|
3 |
0.1249387366 |
0.4771212547 |
-0.9033028900 |
|
4 |
0.0969100130 |
0.6020599913 |
-1.0136313480 |
|
5 |
0.0791812460 |
0.6989700043 |
-1.1013776680 |
|
6 |
0.0669467896 |
0.7781512504 |
-1.1742702440 |
|
7 |
0.0579919470 |
0.8450980400 |
-1.2366323100 |
|
8 |
0.0511525224 |
0.9030899870 |
-1.2911329450 |
|
9 |
0.0457574906 |
0.9542425094 |
-1.3395378010 |
|
Linear Regression results for log10(log10(1+1/D))
= alog10(D) + b |
|||
|
Correlation Coefficient (R) = |
-0.9992296195 |
||
|
|Slope| (a) = |
0.8636655870 |
||
|
Intersection with Y-axis (b) = |
-0.5037512926 |
||
Table 2: Empirical evidences that the Newcomb-Benford
distribution is a special case of Zipf's law.
From the good R = -0.9992 of Table 2 can be
deduced that the Benford first digit distribution law must be a special case of
Zipf's law. A perfect fit has an |R| = 1.10 The approximate formula
of Zipf's law applied for the Benford distribution can be deduced from the
result of Table 2 since:
log10(log10(1+1/D)) =
alog10(D) + b Þ log10(1+1/D) »10bD-a = cD-a (3.1)
fD = log10(1+1/D) » 0.3135080577D-0.8636655870 (3.2)
Here below is the mathematical proof that the
Law of Newcomb-Benford is an (approximate) special case of Zipf's Law.
Theorema:
The Laws of Newcomb-Benford and Zipf are approximately
equal for first digit distributions.
fD
= log10(1+1/D) » cD-a (3.1)
D Î {1, 2, 3, 4, 5, 6, 7, 8, 9}
x =
(1+1/D) (3.2)
log10(x)
= ln(x)/ln(10) (3.3)
ln(x) =
(x-1)/x+1/2((x-1)/x)2+1/3((x-1)/x)3+
…….. (Taylor serial for x ³ 1/2) (3.4)11
ln(x) =
(x-1)/x+(x-1)2/2x2+(x-1)3/3x3+ ……. Þ (3.5)
(1/D)/(1+1/D)+1/2((1/D)/(1+1/D))2+1/3((1/D)/(1+1/D))3+
……. Þ (3.6)
(1/(D+1)+1/2(1/(D+1))2+1/3(1/(D+1))3+
……. (3.7)
(1/(D+1)
> 1/2(1/(D+1))2 > 1/3(1/(D+1))3 > ……. (3.8)
Now some
gross approximations:
ln(1+1/D) »1/(D+1) » 1/D (3.9)
log10(1+1/D)
= ln(1+1/D)/ln(10) » 1/(Dln(10)) Þ (3.10)
fD » (1/ln(10))D-1 = 0.4342D-1
(3.11)
So a = 1
and c » 0.4342 which is of the same
magnitude found by Linear Regression (3.2).
This Theorema is also supported by the fact
that the first derivatives of (1.1) and (3.1) are almost the same. Although
formally you cannot calculate derivatives for these discrete laws, but I do it
only for the sake of argument.
First
Derivative (1.1) fD' = -(Ln(10)D2+Ln(10)D)-1 (3.12)
First
Derivative (3.1) fD' = -acD-a-1 If a = 1 and c = 1/Ln(10)
like in (3.11) then à fD' = -(Ln(10)D2)-1
(3.13)
The Ln(10)D term of (3.12) being the minor one.
But what is also clear from these derivatives that the formula's (1.1) and
(3.11) are not exactly the same. Hence I call Newcomb-Benford an (approximate)
special of Zipf.
From Fig. 3 is clear that not only lognormal
distributions, with a shape parameter > 1.2, comply with the Significant
Digit Law. Although Scott et al.
claim that this is crucial.7 The powers of two, over the tested
range, have an extraordinary compliance with Newcomb-Benford since the χ2
is only 0.007
5. Conclusions
The powers of two are remarkable because they
are certainly not lognormal still they comply experimentally with the Law of
Newcomb-Benford. The χ2 is even among the smallest of the ones
in this paper.
The final conclusion that data sets from all
possible Reals, from a range that ensures that if the data was Uniform the
first digits are all equally likely to occur, will give a good fit with
Newcomb-Benford AND Zipf comes from the facts that:
·
Newcomb-Benford is an (approximate) special
case of Zipf.
·
The
integer powers of two follow both laws.
·
Newcomb-Benford
AND Zipf distributions, as applied in this paper, are scale invariant.
·
The
Goodness of fit will always significantly deviate if the data set is large
enough.
·
Increasing
the mantissa resolution gives rise to convergence of the correlation
coefficient for Zipf.
-o0o-
1)
Newcomb S. 'Note on the frequency of use of the different digits in natural
numbers' Amer. J. Math. 4 39-40 (1881).
2) Benford F. 'The law of anomalous numbers' Proceedings
of the American Philosophical Society 78(4) 551-572 (1938)
3) Wikipedia 'Zipf's Law' http://en.wikipedia.org/wiki/Zipf's_law
4) Mathworld 'Zipf's Law' http://mathworld.wolfram.com/ZipfsLaw.html
5) De Soras L. 'Denormal numbers in floating
point signal processing applications' http://www.musicdsp.org/files/denormal.pdf
6) Not used anymore.
7) Scott P.D., Fasli M. 'Benford's Law: An
empirical investigation and a novel explanation' CSM Technical Report 349
http://cswww.essex.ac.uk/technical-reports/2001/CSM-349.pdf
8) Hill T.P. 'A statistical derivation of the
significant law' Statistical Science 10(4) 354-363 (1996)
9) Not used anymore.
10) William L. Hays 'Statistics' 5th Edition,
Harcourt Brace College Publishers (1994)
11) Efunda Engineering Fundamentals 'Taylor
Series Expansions of Logarithmic Functions'
http://www.efunda.com/math/taylor_series/logarithmic.cfm