Nothing involving computers is ever as random as it seems. The only way to determine whether a random number generator is good enough is through careful testing.
Introduction
Computer-generated random numbers are used intensively in applications ranging from games to complex numerical integrations. If you use random numbers in a program, or try to write a random number generator, you will want to have some confidence that the numbers are adequately random. In this article and a succeeding one next month, we provide a framework and eleven tests for evaluating random number generators.
The beginning of wisdom about computer-generated random numbers is to remember always: computer-generated random numbers are not random. One synonym for "random" is "unpredictable." Given the starting point and the generator, sequences of random numbers generated on a computer are perfectly predictable. This explains why such numbers more formally are called "pseudorandom numbers." In this article, we nevertheless use the simpler term "random number."
Even though no test can determine whether a generator produces numbers that are truly random, a test can determine whether random numbers appear to be sufficiently random for their intended use. For example, computerized solitaire games use random number generators to draw cards. Given the starting point and the generator, the sequence of cards drawn is perfectly predictable. This observation, though, is not pertinent for someone playing the game who has not disassembled the generator. The relevant issue for people playing the game is whether the cards drawn appear to be unpredictable given the information that they have.
We are going to describe two kinds of tests. The first kind, described this month, are statistical tests on sequences of random numbers. These sequences are much shorter that a full cycle (defined in the next paragraph) because we do not want our users to spend the rest of eternity on any single test. Actually, these shorter sequences serve their intended purpose.
The second kind of tests are full-cycle tests, which examine the properties of a generator. One of these properties is the cycle length of the generator how long a sequence it generates before it repeats. Since computers are finite-state machines, all random number generators eventually repeat the values in their sequences exactly. The period of a generator is the minimum number of random numbers it can produce before the sequence repeats. We cannot write a useful program to determine cycle length. For some generators, this repetition occurs after (232)-1 random numbers or even less. For one generator that we study, this repetition occurs only after 3x10171 random numbers. This large a number is hard to contemplate, but one comparison is to the age of the universe, which is on the order of 10 billion years, or a mere 3*1023 microseconds.
In the second article in this series, we present a program for a computationally demanding full-cycle test, the spectral test. This test is devoted to assessing the quality of linear congruential random number generators.
Overview of Statistical Tests
Each statistical test examines subsequences of random numbers from a generator for a specific pattern or set of patterns. While the patterns differ across the tests, the underlying issue being examined is the same. In each case, the issue is whether, given the number of observations in the subsequence, the computer-generated random numbers appear to be random, or patternless. While patterns generally will appear on occasion by chance, the tests examine whether the pattern appears more frequently than by chance. By design, each test is relatively good at looking at a specific pattern or set of patterns. Each test produces a probability of a test statistic less than or equal to some value. As we explain below in the context of specific tests, a group of such
probabilities can be summarized in a further test to determine whether the computer-generated random numbers appear to be random. If this summary probability indicates the pattern appears too often to be explained by chance, then the test has revealed a deficiency of the generator. If the pattern is inconsistent with what you want to do with the generator, the generator has not passed the test, or conversely, the generator has failed the test.
No generator passes or fails the tests in an absolute sense. A generator passes or fails only within the bounds of what you deem acceptable. Anyone running these tests establishes the limits for passing or failing. General guidelines are available from the statistics literature and our experience, but ultimately the responsibility lies with the person running the tests.
There are countless possible statistical tests of random number generators. We do not claim that our set of tests is sufficient for all possible uses of random number generators. For example, if you write a simulation program and look for certain patterns in the output you will want to know whether the random numbers to be used exhibit the same or related patterns. If the patterns at issue are not tested by our programs you will have to write your own test. By doing so, you have some confidence that any conclusions from the simulation model are due to the simulation model, not the random number generator.
In this article, we present code that implements ten tests discussed by Knuth (1981). This is a reasonably comprehensive collection of tests, but certainly does not examine all possible patterns in the output.
A Frequency Test
We use a frequency test to illustrate the general principles of the tests. Most, if not all, random number generators attempt to generate numbers that are equally likely. Such a distribution of values is called "uniform" because all possible values are equally or uniformly likely. Frequency tests examine whether the frequency of different random numbers is consistent with the subsequences that would be produced by a uniform distribution.
Suppose that you are interested in using a random number generator to generate subsequences of, say, 100 numbers and want to be comfortable that the generated values will look approximately like a uniform distribution. You can use the Kolmogorov-Smirnov (KS) test to check for this property. The KS test uses the maximum deviation between the theoretical uniform distribution and the empirical distribution as a criterion to decide whether the sample was generated by a uniform distribution.
The general idea of the test is illustrated in Figure 1. Suppose that 10 random numbers are generated and sorted in increasing order. The tick marks on the horizontal line at the bottom of Figure 1 show one possible set of such numbers. The horizontal axis in the figure is the set of all possible random numbers, which in this example ranges from 0 to 100. The vertical axis shows the cumulative probability, the probability of that value or or a smaller one occurring. If random numbers from 0 to 100 are equally likely, or equivalently the random numbers are uniformly distributed, the cumulative probability for this theoretical distribution is indicated by the 45-degree line in the figure. The empirical distribution of the 10 random numbers generated is a step function and is shown by the step curve in Figure 1. The vertical distance between the theoretical distribution and the empirical step function indicates the deviation between the theoretical and empirical distribution. Large deviations suggest that the empirical distribution is different than if it were generated by a uniform distribution.
Statistical theory indicates that it is preferable to calculate the maximum positive deviation, KnPlus, and the maximum unsigned negative deviation, KnMinus. Our subroutine KSFreq does so. The subroutine KSFreq in Listing 1 shows more precisely how the statistics for the KS test are computed. (The code for this article is too long for us to include all of it. We show illustrative parts. The rest is available in the fashion indicated at the beginning of this magazine.)
The first routine in ksfreq.c is IntCmprFun, a comparison routine passed to the C library function qsort in KSFreq. The call to subroutine KSFreq passes the address of a structure that contains SampleSize, the user-specified number of random numbers in the test, and RandFun, the address of the random number generator. On return, the structure contains the increment to TotNumGen, the total number of random numbers used in the set of tests, as well as the KS test statistics, KnPlus and KnMinus. The structure also contains the cumulative probabilities associated with statistics KnPlus and KnMinus, KnPlusProb, and KnMinusProb respectively.
Such cumulative probabilities sometimes are called "f-values." We use that shorthand name for the computed probabilities of test statistics. An f-value is the probability of a statistic as small or smaller than the observed one if the generator has the property assumed by the test. Routine KSFreq calls KSmirnov, available with the code for this article, to compute the f-values for KnPlus and KnMinus. Program, freqtst, which includes KSFreq, could print these f-values for KnPlus and KnMinus and leave it at that. That is not as informative, though, as a further step.
Looking at one set of SampleSize random numbers is informative, but it is possible to do better. Suppose that you do a KS test on a set of SampleSize random numbers and get an f-value of 0.99 for KnPlus. This means that, if the distribution generating the random numbers is uniform, 99 percent of the time you would expect to get a maximum deviation between the theoretical and empirical distribution this small or smaller. Only one percent of the time would you expect to get a maximum deviation this large or larger. In short, the maximum deviation between the distributions is relatively large and not very likely if the random numbers are uniformly distributed.
Should you throw away the generator? If it were impossible to rerun the test, you should throw the generator away. Given the data available, it's not very likely that the generator is producing numbers consistent with a uniform distribution. Obviously though, you can rerun the test. Suppose that a second KS test has an f-value of 0.10. This indicates that the maximum deviation between the theoretical and empirical distributions is relatively small and is quite consistent with a uniform distribution. Which f-value should you believe? Neither is sufficient to indicate what you ought to do.
Because you can rerun the test many times, you can run a set of KS tests and see whether the set of results is consistent with a uniform distribution. Sometimes you will get a relatively large maximum deviation. Sometimes not. If the random numbers are generated by a uniform distribution, any f-value that you see, whether 0.10 or 0.99 or any other value, is itself equally likely. This implies that, if the random numbers are consistent with a uniform distribution, the set of f-values itself has a uniform distribution.
Instead of calling KSFreq to do one KS test, the routine ExecKolSmirTest in Listing 2 does NumKSRuns tests. The value of NumKSRuns is specified by the user. The program freqtst does a KS test on the set of NumKSRuns probabilities from the KS tests. Looking at the probability of a set of results makes it far more likely that you will draw a correct conclusion about the generator.
The general strategy for testing random number generators, then, is the following. The underlying test examines a set of random numbers for some pattern. For example, the KS test looks at the maximum size of deviations between the theoretical and empirical distribution for SampleSize random numbers. Then, the programs run a KS test on the f-values from a set of such tests. This testing procedure makes it possible to be reasonably confident that random numbers in subsequences of SampleSize length are what you're likely to get from uniformly distributed random numbers.
Additional Frequency Tests
The KS test is an example of a frequency test. The property examined in this test is the frequency of various random numbers and their consistency with a uniform distribution.
The Chi-Square Test
Another way to examine the frequency of various random numbers is a chi-square test on frequencies. The basic idea of this test is similar to the KS test. The major difference is that, rather than using each random number individually, the numbers are grouped into NumElements groups with nontrivial frequencies in each group. A commonly used rule of thumb is that each group should contain at least five observations on average. In general, the KS test is a better test whether a set of random numbers is consistent with a uniform distribution, but the chi-square test can be useful if you intend to group the observations from the generator.
Listing 3 shows the code for ChiSqFreq, which calculates a chi-square test statistic on ChiSqData.NumVariates. This routine is broadly similar to KSFreq. Just as the KS test was called NumKSRuns times to generate input for a KS test, ExecChiSqTest in Listing 4 calls ChiSqFreq NUM_PROBS times in order to generate a set of probabilities that are then passed to KSCalc for a KS test.
The Collision Test
Another frequency test is the collision test. In a way, this test is the inverse of the chi-square test. Rather than group the observations to create a nontrivial probability of a significant number of observations in each group, the collision test groups the observations so that repeat values are relatively unlikely. Repeat values are unlikely because the test sets up a large number of groups relative to the length of the subsequence and the possible values of the random numbers.
The number of bits used from the random numbers is set by the user. The number of random numbers is set at 16,384 (214) and the number of groups is set at 1,045,876 (220). These particular values are suggested by Knuth (1981, p. 69). This test is called the "collision test" because the test examines such groups for an excessive number of repeated values, or collisions. As you might expect by now, the program then does a KS test on the f-values from a set of collision tests.
Tests on Nearby Values
Rather than examining the frequency of values in subsequences of random numbers, the remaining statistical tests examine the relationship between nearby values for patterns. All of the tests examine nearby values to see if they're inconsistent with nearby values being independent. Because random number generators use previous values to determine succeeding values, nearby values clearly are not independent. This sequential generation of random numbers does not imply, though, that the random numbers do not appear to be independent. The tests in this section examine subsequences to see whether there is apparent sequential dependence that the tests can detect. The purpose of each test is detect a pattern, and the tests look for different patterns.
The Serial Test
In many respects, the serial test is similar to the chi-square test on frequency. The difference is, whereas the chi-square relative frequency test examines whether individual values are consistent with a uniform distribution, the serial test examines whether sequential, non-overlapping pairs of random numbers have a uniform distribution. This test groups the random numbers into a relatively small number of groups.
The random numbers from the generator are mapped into a relatively small number of distinct integer values from 10 to 181. This reduction in the number of distinct values is accomplished by the modulo function. With these numbers of distinct values, the number of groups also is reasonable, from 100 to 32,761. A chi-square value tests whether the different pairs are equally likely. As above, a KS test is applied to the f-values from a set of these tests.
The Serial Correlation Test
Another fairly general test that looks at the relationship between nearby values is the serial correlation test. The simplest version of this test, in Knuth (1981), calculates the correlation between one random number and the preceding one and examines whether this correlation is zero. This single correlation is a special case of a more general examination of correlations of one random number and many preceding ones. This is the test implemented in CalcSerCorChiSq in Listing 5.
Essentially, the routine computes correlations of one random number with preceding ones at various distances and then calculates a test statistic, ChiSqStat, that has a chi-square distribution. The f-values from these tests are then used in a KS test. Because you may be interested in extreme values, CalcSerCorChiSq also keeps track of the number of calculated coefficients more than two standard deviations from the theoretical mean of zero. These computations also indicate how you can scan for other extreme values that might interest you.
The Run Test
A rather different but very powerful test for sequential randomness is the run test. An example of a run of length three is 4, 5, 6, followed by a 2. This kind of run is known as a "run up." The run test examines subsequences of random numbers and counts the runs up that have lengths 1, 2, 3, 4, 5, and 6 or more. Based on a suggestion in Knuth (1981, p. 74, exercise 14), the program then does a chi-square test on the lengths of the runs. This test determines whether the runs up occur too frequently or not frequently enough for a set of uniformly distributed independent random numbers. As we do in general, we then do a KS test on the f-values from a set of such tests.
Other Tests
Other classic tests for serial dependence in random numbers also are included in the suite of tests. These tests are the gap test, the coupon-collector's test, the permutation test, the poker test, and the maximum-of-t test. These statistical tests examine subsequences for dependence of random numbers on other values. Detailed discussions are available in Knuth (1981) and Niederreiter (1992.)
Examples of Test Results
In the suite of tests, each test is executed by a stand-alone program with many common routines. These common routines range from simple ones that read integers from standard input to more complex routines that calculate the f-value for a KS test statistic. The common routines include a standard interface that makes it easy to add a random number generator to the set already included. As currently written, the testing programs include a variety of generators. The generators currently included in the suite are:
- Plauger ANSI C (Plauger, 1992)
- the generators in Dwyer (1995)
- Math77 (Jones and Seaton, p. 3.1-5)
- Knuth line 26, (Knuth, p. 102)
- Knuth line 27, (Knuth, p. 102)
- Fibonacci sequence (Knuth, p. 26)
- Knuth subtractive (Knuth, pp. 171-173)
- Marsaglia-Zaman-Tsang Ranmar (James, pp. 339-341)
- Marsaglia-Zaman-James Acarry (James, pp. 341-342)
- Linear Shift Register (Maier, 1991)
- Moshier drand (Wichmann and Hill)
- the infamous RANDU (Knuth, p. 104)
- Approximate Factoring Algorithm (Park and Miller)
- and the compiler's library generator
In addition, given the set of interface routines, it is reasonably straightforward to add tests, such as those suggested by Fishman and Moore (1982) and Marsaglia (1985).
Figure 2 shows the output for the serial test on two generators: the universally deplored generator RANDU and L'Ecuyer's combination generator from Dwyer (1995). As Figure 2 indicates, the parameters are the same for both tests. For RANDU, Figure 2 indicates that the positive KS statistic, K(100)+, is 0.000000, indicating that the maximum positive deviation is 0.000000. The f-value, ie. the probability of this small or a smaller value, is 0.000001 percent. Conversely, the probability of this large or a larger value is 0.999999 percent. In sum, to six digits, there are no positive deviations.
The figure also indicates that the negative KS statistic, K(100)- is 10.000000. The probability of this small or a smaller value is 100.000000 percent, which indicates that the probability of this large or a larger value is 0.000000 percent. In other words, this large a value is extremely unlikely and is quite inconsistent with a uniform distribution. Underlying these particular f-values is a set of f-values for all of the individual serial tests that are close to 1.0. Such an outcome is quite unlikely for 100 serial tests. Figure 2 also provides the test statistics for the combination generator. These f-values are quite consistent with uniformly distributed adjacent values.
Using the tests in this article, how do you know if a generator is good or bad? Should a good generator pass all of the tests? Or is a generator that passes some, say eight of the eleven, tests the best that can done? As it turns out, many generators pass all of the tests, with no obvious speed penalty. Hence, our advice is: given the wide availability of generators that pass all of the tests, why settle for one that doesn't?
Next month we'll describe our implementation of the spectral test (Knuth, pp. 89-110; Hopkins). This test checks the quality of linear congruential generators. It employs a library of extra-precision (100-digit) arithmetic. As Knuth (p. 89) points out, "This is by far the most powerful test known and it deserves particular attention." We also will include a brief discussion of the math library. Also in next month's article we'll provide a table of the cycle lengths of all the generators mentioned previously. This will include the one with the cycle length of 3*10171.
References
[1] Jerry Dwyer, "Quick and Portable Random Number Generators," C/C++ Users Journal, 13(6), June 1995, pp. 33-44.
[2] George S. Fishman and Louis R. Moore, "A Statistical Evaluation of Multiplicative Random Number Generators with Modulus 231-1," Journal of the American Statistical Association, 77 (1982), pp. 129-136.
[3] T.R. Hopkins, "A Revised Algorithm for the Spectral Test," Applied Statistics, 32(3), 1983, pp. 328-335 (in Fortran).
[4] F. James, "A review of pseudorandom number generators," Computer Physics Communications, 60, 1990, pp. 329-344.
[5] Lisa K. Jones and Laurie Seaton (eds.), Math77 Reference Manual, Language Systems Corporation, Sterling, Va., 1994.
[6] Donald E. Knuth, The Art of Computer Programming, Volume 2, Seminumerical Algorithms, 2nd Ed., Addison Wesley, Reading, Pa., 1981.
[7] George Marsaglia, "A Current View of Random Number Generators," Computer Science and Statistics, Proceedings of the Sixteenth Symposium, L. Ballard (ed.), 1985, pp. 3-10.
[8] W.L. Maier, "A Fast Pseudo Random Number Generator," Dr. Dobb's Journal, 176, 1991.
[9] Harald Niederreiter, Random Number Generations and Quasi-Monte Carlo Methods, 63, CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia, Pa., Society for Industrial and Applied Mathematics, 1992.
[10] P.J. Plauger, The Standard C Library, Prentice Hall, Englewood Cliffs, 1992.
[11] Brian Wichmann and David Hill, "Building a Random-Number Generator," BYTE magazine, March, 1987, pp 127-128.
Jerry Dwyer is a Professor of Economics at Clemson University and a visiting scholar at the Federal Reserve Bank of Atlanta. He has a Ph.D. from the University of Chicago and has been programming for twenty years. He primarily writes programs for statistical analysis. He can be reached at [email protected]. The views expressed are those of the authors and not necessarily those of the Federal Reserve Bank of Atlanta or the Federal Reserve System.
K.B. Williams received a B.A. in Mathematics from California State University at Sacramento in 1955. He is a veteran of forty years in the scientific applications programming trenches and is the founder of King Baker Enterprises, in Stillwater, Oklahoma. He consults in applied numeric techniques. He can be reach at 802 South Ridge Drive, Stillwater, OK 74074, or via e-mail at [email protected].