number spaces from reals follow newcomb-benford and zipf

number spaces from reals follow newcomb-benford and zipf. An indications that both laws are equal for the first digit distribution. More emirical evidence is given as is the mathematical proof. The origin of compliance of the Reals with the laws comes from the fact that the powers of two, in other words the exponent part for the normal reals formula, give an Newcomb-Benford distribution and that Newcomb-Benford distributions are scale invariant.

Copyright & Disclaimer

 

 

 SATOCONOR .COM

J.G. van der Galiën ‘Number Spaces Of Reals Follow Newcomb-Benford’s And Zipf’s Law’ 4.1. (2005)

Full paper

SATOCONOR.COM Journal of Computer Science

 

 

Number Spaces Of Reals Follow Newcomb-Benford’s And Zipf’s Law

Newcomb-Benford’s law is a special case of Zipf’s law, proven empirically and mathematically.

By Johan G. van der Galiën (M.Sc.)

For comments: galien8@zonnet.nl

Version 1.0 September 7, 2005

Home of SATOCONOR.COM

 

Abstract:

Non-regular Reals with an fixed 11 bits exponent (E) and a variable mantissa (M) of 4, 8, 12, 16, 20 or 24 bits were used to produce number number spaces between 1e-0307 and 1e+0308 in computer memory. The number spaces were checked for Uniform, Newcomb-Benford and Zipfian distribution of the first digits. The coefficient for Zipfian correlation from Linear Regression was always around 0.9992 and converged with a increasing number of mantissa bits. Uniform and Newcomb-Benford conformity was tested with the Goodness-Of-Fit method. A Uniform distribution was totally out of the question because the χ2‘s were always bigger than 104. Newcomb-Benford conformity was initially good but seemed to diverge. These empirical results can be explained in terms of that there always will be a significant χ2 deviation if the sample size is large enough and the difference between artificial and real data sets. Using the good fits with Zipf and Newcomb-Benford, for the 8 and 16 bits mantissae, in a new kind of randomness test is proposed. The origin of the good correlation for both laws must come from the fact that the exponent part of the Reals formula, in other words the integer powers of two, gives itself a nearly perfect Newcomb-Benford distribution (χ2 ≈ 0). It must be this fact because the mantissa part is of course uniformly distributed over the significant digits space between 1 and 2. Mathematical proof for this comes from interpreting the literature. For it is proven that the Newcomb-Benford distributions are scale independent. Empirical and mathematical proof is given that the Law of Newcomb-Benford is an (approximate) special case of Zipf’s Law. This explains the fact that for 4-20 bits mantissae there is a good fit for both laws.

 

1. Introduction

The Law of Newcomb-Benford is also known as the First Digit Law, the Most Significant Digit Law and the Significant Digit Law. But most of the time it is called the Law of Benford. But this does not do justice to Newcomb who discovered and described this law for the first time.1 Benford, seemingly unaware of this, also discovered this law and did measurements on real data sets and published about it in 1938.2 So I will call it the law of Newcomb-Benford in this paper, in the hope that everybody who reads this will follow.

 

fD = Log10(1+1/D) (1.1)

fD is the relative frequency of the first digit D in the data set

D is the first digit as D Î {1, 2, 3, 4, 5, 6, 7, 8, 9}

 

The Law of Zipf was discovered from ranking word count frequencies. Zipf discovered that they obeyed approximately the following formula:

 

fn ≈ cn-a  (1.2)

fn is the relative frequency for the word that ranks in the nth place

n is the ranking (of course a positive non-zero integer)

The harmonic number c and the a are positive Reals and constant for a given data set

 

If we had infinite many words in the English Language than a of (1.2) would be > 1 and c would be the value of the Riemann Zeta function at a.3,4 These formula's are actually probability mass functions and summations are instances of the so called cumulative probability function. For these are discrete laws despite the fact that (1.1) also works for any integer kind of initial digit sequence. (D can then be the first two digits like 14 or the first four digits like 9876 and so one.)

Reals can be divided in to Zeroes, Infinities, Normals, Denormals and NaN’s. To understand this you need to think in the bits reserved for the sign (S), exponent (E) and mantissa (M). For the C, C++, C#, PASCAL, PASCAL and VISUAL BASIC Double (64 bits real) data type this is illustrated in Fig. 1.

 

S 1 bit

E 11 bits

M 52 bits

 

S = 0 or 1

E = 0 to 2047

M = 0 to 4503599627370495

 

Double is a Log10(2P) = Log10(4503599627370496) = 15 – 16 digits significant real

 

Normals 0 < E < 2047 real = (-1)S2E-B(1+M/2P)       B is the exponent bias is set to 1023 (1.3)

P is the mantissa resolution is 52 in the case of double

Denormals E = 0 M<>0 real = (-1)SM21-B-P (1.4)

Zeroes E = 0 M = 0 real = (-1)S0 = 0 (1.5)

Infinities E = 2047 M = 0 (-1)SInfinity = ±Infinity (1.6)

NaN’s (Not a Number) E = 2047 M <> 0

 

Range Normals and Denormals is ±4.940656458412465e-0324 to ±1.797693134862315e+0308

 

Fig. 1: The specification of the 64 bits Double real data type.5

 

I used 11 bits E and 4, 8, 12, 16, 20 and 24 bits M for my research described in this paper. The S was always zero, so all the numbers are positive. How to calculate the values of the Reals is given in the following example for 11 bits exponent (E) and 8 bits mantissa (M) (Fig. 2).

 

S = 0

E = 00000000100

M = 10110111

 

 

M-binary = 10110111 M = 1*27+0*26+1*25+1*24+0*23+1*22+1*21+1*20 = 183

 

E-binary = 00000000100 E = 1*22 = 4

 

I only used the Normals, which is (2046/2048)*100 = 99.90% of the total Real number space. The formula is then:

 

Real = (-1)024-1023(1+183/28) = 3.05e-0307 (1.7)

 

Since Log10(256) = 2.408 So the 11 bits E and 8 bits M Real data type has 2 to 3 significant digits

 

Fig. 2: How to calculate the value of a Real from a bit stream

 

2. Methods and Materials

From Reals (with the S bit is always zero, an 11 bits E and a M of 4, 8, 12, 16, 20 and 24 bits in the range of 1e-0307 to 1e+0308) the cumulative frequency of the most significant digits were measured for the complete number space. The range assured that if the H0 (the distribution is Uniform) was true the digits had an equal change to occur. A Goodness-Of-Fit test with 8 degrees of freedom was applied for all mantissae for the H0 (the distribution follows the First Digit Law). Additionally Linear Regression was conducted on the nine Log10’s of the relative frequencies by scatter plotting the data against Log10(D). For correlation with Zipf. The programs were written in PASCAL and run on a Pentium-4 2.8GHz 512Mb machine. To give you an indication: The longest run, for a mantissa of 24 bits, took 13 hours 29 minutes and 14 seconds.

 

3. Results

 

Mantissa bits

4

8

12

 

 

 

 

Correlation Uniform

 

 

 

χ2

15987

255793

4092690

 

 

 

 

Correlation Benford

 

 

 

χ2

0.006

0.002

0.025

 

 

 

 

Correlation Zipf

 

 

 

Slope (a)

-0.863724554823330

-0.863681236955406

-0.863677897472362

Intersection (b)

-0.503723354973614

-0.503744000092615

-0.503745353279953

Correlation coefficient (R)

-0.999226021417447

-0.999228913145435

-0.999228872179429

 

 

 

 

Significant digits

1-2

2-3

3-4

 

 

 

 

Number space (number of Reals)

32688

523005

8368082

 

 

 

 

Mantissa bits

16

20

24

 

 

 

 

Correlation Uniform

 

 

 

χ2

65482838

1047725347

7076226062

 

 

 

 

Correlation Benford

 

 

 

χ2

0.252

4.026

64.299

 

 

 

 

Correlation Zipf

 

 

 

Slope (a)

-0.863673707385492

-0.863673667899045

-0.863673655978671

Intersection (b)

-0.503747335158292

-0.503747352336366

-0.503747358013820

Correlation coefficient (R)

-0.999229089528327

-0.999229090155236

-0.999229090810492

 

 

 

 

Significant digits

4-5

5-6

6-7

 

 

 

 

Number space (number of Reals)

133889326

2142229211

34275667382

 

Table 1: The results for the Newcomb-Benford and Zipf analysis of 11 bits exponent Reals with different amounts of mantissa bits.

 

----------------------------

Newcomb-Benford and Zipf tests on the integer powers of two, Reals as Extended

----------------------------

Total amount of powers tested = 32766

----------------------------

Test for the fit with the Law of Newcomb-Benford

CHI-Newcomb-Benford = 7.48707874884128e-0003

CHI 8 degrees of freedom 5%  = 2.73

CHI 8 degrees of freedom 95% = 15.51

----------------------------

Test for correlation with Zipf

Slope = -8.63808338620714e-0001

Intersection = -5.03682893499106e-0001

Correlation coefficient (R) = -9.99220976268348e-0001

----------------------------

Time begin = 8/30/2005 9:58:39 PM

Time end   = 8/30/2005 9:58:39 PM

----------------------------

e1 = 9863

e2 = 5772

e3 = 4092

e4 = 3175

e5 = 2597

e6 = 2193

e7 = 1899

e8 = 1677

e9 = 1498

----------------------------

 

Fig. 3: The output in the logfile from the Zipf and Newcomb-Benford analysis of the integer powers of two between 2-16383 and 216383

 

4. Discussion

χ2 will always detect a significant difference when a large enough sample is used. This is a well-known phenomenon.7 The Goodness-Of-Fit test must be too strict in the case of mantissae larger than 20 bits. In other words sample space bigger than 2142229211 Reals. The only conclusion can be that all tested mantissae maybe comply with the Most-Significant-Digit Law. From Table 1 it is clear that the first digit distribution converges to a small extent with Zipf by increasing the mantissa resolution.

The χ2 Benford are below the 5% probability for the 4-16 bits mantissae. In other words too good to be true. But the Goodness-Of-Fit method is actually not mend for complete number spaces but for relative small random samples drawn from them.

In my opinion if the size of the exponent and the size of the mantissa increases so will the relative frequencies of the most significant digits converge to the expected values and the Zipf correlation coefficient will converge to –1.

The fact that the number spaces of the tested Reals comply with Newcomb-Benford and Zipf can be understood from the fact that the integer powers of two themselves have a Newcomb-Benford distribution. Multiplying by a real number (1+M/2P) does not change this fact as Hill demonstrated in his famous article. For he has proven that Newcomb-Benford distributions are scale invariant.8 That it also complies with the law of Zipf comes from the fact that the Newcomb-Benford Law is actually an (approximate) special case of Zipf’s Law. (As I will demonstrate later on.) Which itself is an approximate and experimental equation.

One can base a randomness test suite on the Goodness-Of-Fit with Newcomb-Benford and the correlation with Zipf of 8 bits E and 8 bits M or 8 bits E and 16 bits M etc. As long as the E MOD 8 = 0, M MOD 8 = 0 and if the amount of bits does not exceed the standard Integer data types of the mentioned programming languages. Also care must be taken that two is not raised to a power, for power part of the Real, exceeding the range of the used Real data type. Then programming such a random test is easy. I did this for E and M is 8 bits and also programmed it that it divides the sample from a (Pseudo) Random Number Generator into 2 or up to 1000 parts. Averaging the χ2 and R. A similar test based on the powers of two is also in the suite. The criterion of the R for a (near) infinite random sample is that it must be (almost) equal to the R for the total data set. Because an infinite random file will give the same R as the total data set.

 

As I postulated earlier Newcomb-Benford is an (approximate) special case of Zipf. Here below is the empirical and mathematical proof:

 

D (=digit)

Log10(1+1/D)

Log10(D)

Log10(log10(1+1/D))

1

0.3010299957

0

-0.5213902277

2

0.1760912591

0.3010299957

-0.7542622013

3

0.1249387366

0.4771212547

-0.9033028900

4

0.0969100130

0.6020599913

-1.0136313480

5

0.0791812460

0.6989700043

-1.1013776680

6

0.0669467896

0.7781512504

-1.1742702440

7

0.0579919470

0.8450980400

-1.2366323100

8

0.0511525224

0.9030899870

-1.2911329450

9

0.0457574906

0.9542425094

-1.3395378010

Linear Regression results for log10(log10(1+1/D)) = alog10(D) + b

Correlation Coefficient (R) =

-0.9992296195

|Slope| (a) =

0.8636655870

Intersection with Y-axis (b) =

-0.5037512926

 

Table 2: Empirical evidences that the Newcomb-Benford distribution is a special case of Zipf's law.

 

From the good R = -0.9992 of Table 2 can be deduced that the Benford first digit distribution law must be a special case of Zipf's law. A perfect fit has an |R| = 1.10 The approximate formula of Zipf's law applied for the Benford distribution can be deduced from the result of Table 2 since:

 

log10(log10(1+1/D)) = alog10(D) + b Þ log10(1+1/D) »10bD-a = cD-a (3.1)

 

fD = log10(1+1/D) » 0.3135080577D-0.8636655870 (3.2)

 

Here below is the mathematical proof that the Law of Newcomb-Benford is an (approximate) special case of Zipf's Law.

 

Theorema: The Laws of Newcomb-Benford and Zipf are approximately equal for first digit distributions.

 

fD = log10(1+1/D) » cD-a (3.1)

 

D Î {1, 2, 3, 4, 5, 6, 7, 8, 9}

 

x = (1+1/D) (3.2)

 

log10(x) = ln(x)/ln(10) (3.3)

 

ln(x) = (x-1)/x+1/2((x-1)/x)2+1/3((x-1)/x)3+ …….. (Taylor serial for x ³ 1/2) (3.4)11

 

ln(x) = (x-1)/x+(x-1)2/2x2+(x-1)3/3x3+ ……. Þ (3.5)

 

(1/D)/(1+1/D)+1/2((1/D)/(1+1/D))2+1/3((1/D)/(1+1/D))3+ ……. Þ (3.6)

 

(1/(D+1)+1/2(1/(D+1))2+1/3(1/(D+1))3+ ……. (3.7)

 

(1/(D+1) > 1/2(1/(D+1))2 > 1/3(1/(D+1))3 > ……. (3.8)

 

Now some gross approximations:

 

ln(1+1/D) »1/(D+1) » 1/D (3.9)

 

log10(1+1/D) = ln(1+1/D)/ln(10) » 1/(Dln(10)) Þ (3.10)

 

fD » (1/ln(10))D-1 = 0.4342D-1 (3.11)

 

So a = 1 and c » 0.4342 which is of the same magnitude found by Linear Regression (3.2).

 

This Theorema is also supported by the fact that the first derivatives of (1.1) and (3.1) are almost the same. Although formally you cannot calculate derivatives for these discrete laws, but I do it only for the sake of argument.

 

First Derivative (1.1) fD' = -(Ln(10)D2+Ln(10)D)-1 (3.12)

 

First Derivative (3.1) fD' = -acD-a-1 If a = 1 and c = 1/Ln(10) like in (3.11) then à fD' = -(Ln(10)D2)-1 (3.13)

 

The Ln(10)D term of (3.12) being the minor one. But what is also clear from these derivatives that the formula's (1.1) and (3.11) are not exactly the same. Hence I call Newcomb-Benford an (approximate) special of Zipf.

 

From Fig. 3 is clear that not only lognormal distributions, with a shape parameter > 1.2, comply with the Significant Digit Law. Although Scott et al. claim that this is crucial.7 The powers of two, over the tested range, have an extraordinary compliance with Newcomb-Benford since the χ2 is only 0.007

 

5. Conclusions

The powers of two are remarkable because they are certainly not lognormal still they comply experimentally with the Law of Newcomb-Benford. The χ2 is even among the smallest of the ones in this paper.

The final conclusion that data sets from all possible Reals, from a range that ensures that if the data was Uniform the first digits are all equally likely to occur, will give a good fit with Newcomb-Benford AND Zipf comes from the facts that:

·        Newcomb-Benford is an (approximate) special case of Zipf.

·        The integer powers of two follow both laws.

·        Newcomb-Benford AND Zipf distributions, as applied in this paper, are scale invariant.

·        The Goodness of fit will always significantly deviate if the data set is large enough.

·        Increasing the mantissa resolution gives rise to convergence of the correlation coefficient for Zipf.

-o0o-

 

Notes and References:

1) Newcomb S. 'Note on the frequency of use of the different digits in natural numbers' Amer. J. Math. 4 39-40 (1881).

2) Benford F. 'The law of anomalous numbers' Proceedings of the American Philosophical Society 78(4) 551-572 (1938)

3) Wikipedia 'Zipf's Law' http://en.wikipedia.org/wiki/Zipf's_law

4) Mathworld 'Zipf's Law' http://mathworld.wolfram.com/ZipfsLaw.html

5) De Soras L. 'Denormal numbers in floating point signal processing applications' http://www.musicdsp.org/files/denormal.pdf

6) Not used anymore.

7) Scott P.D., Fasli M. 'Benford's Law: An empirical investigation and a novel explanation' CSM Technical Report 349 http://cswww.essex.ac.uk/technical-reports/2001/CSM-349.pdf

8) Hill T.P. 'A statistical derivation of the significant law' Statistical Science 10(4) 354-363 (1996)

9) Not used anymore.

10) William L. Hays 'Statistics' 5th Edition, Harcourt Brace College Publishers (1994)

11) Efunda Engineering Fundamentals 'Taylor Series Expansions of Logarithmic Functions'

http://www.efunda.com/math/taylor_series/logarithmic.cfm