Rolling Two Dice 1200 Each To See That They Are Fair

5 min readFeb 10, 2022

Versão em português aqui.

I have these two dices that were gifted to me by a friend, and I love them. But I wanted to know if the dice are fair enough to play some board games and to show people that I’m using those dice because I like them, not because some shady way to win more games.

So, I started to rolling! I used the same table to roll, trying to do this the same way, because while truly fair dice should be fair on any surface, some types of bias may show up only on some surfaces.

Ideal dice

When you roll a die, ideally you are randomly selecting one of the faces of the die. Each roll is also independent of the previous roll , that is, the odds of getting a particular face shouldn’t depend on what you’ve rolled in the past. And the probability distribution of the faces should be equal, or even, the probability of getting any face should be 1/6, or more generally 1/f, where f is the number of faces on the die.

So, if you roll a fair die a lot of times, count how many times each face comes up, and graph it in a histogram, your graph should look flat, like this:

ideal histogram — Ideal histogram (https://timothyweber.org/)

But when you actually roll the die, your histogram end up more like this:

Real histogram (https://timothyweber.org/)

The question is, how many times we need to roll the die to be confident that the real histogram is close enough to the ideal histogram to see the fairness or the most common top face?

Statistical Tests

For the statistical test Pearson’s χ², a common rule of thumb is to have at least five times as many rolls as there are sides on the die. Thus, for a six sided die, you need at least 30 rolls for the test to be valid.

Now for the Anderson-Darling test, you should have only 20 rolls to see some results.

Obviously, more rolls won’t hurt if you have the patience for it, and the more rolls you tally up, the better the test will detect subtle biases, and we can improve the statistical power of the test, which determines not how accurate the test result is but how useful it is.

So, for improve the usability of my tests, I rolled each die 30 times and wrote down like this:

# Sample 0plus[0] = [2, 6, 6, 5, 4, 3, 6, 6, 4, 3, 6, 4, 1, 5, 3, 2, 6, 6, 4, 4, 5, 4, 6, 3, 5, 4, 3, 2, 4, 5]minus[0] = [3, 1, 6, 6, 6, 5, 2, 2, 1, 3, 2, 3, 5, 5, 6, 3, 1, 6, 6, 1, 4, 1, 5, 5, 2, 2, 1, 4, 3, 3]

Repeated 40 times, stacked together and started testing!

That’s how my histograms looks like:

Test Results

For the Pearson’s χ² test, we want to see if I have a close enough ideal histogram, in other words, if the frequency of each side are the same, therefore we have the following hypothesis:

H0: The frequencies of each side of the die are equally likely.
H1: One or more sides of the dice has frequency significantly different from the other sides.

The code for the test:

critical = [0.85, 0.9, 0.95, 0.975, 0.99]
statistic, p_value = stats.chisquare(freq_plus)
print(f'Χ² = {statistic}')for crit in critical:
  print('')
  if statistic < stats.chi2.ppf(crit, 5):
    print(f'Fail to reject H0 by critical value with {100*crit}% confidence.')
  else:
    print(f'Reject H0 by critical value with {100*crit}% confidence.')
  if p_value < (1-crit):
    print(f'Reject H0 by p-value with {100*crit}% confidence.')
  else:
    print(f'Fail to reject H0 by p-value with {100*crit}% confidence.')

Plus die:

Χ² = 5.92
Fail to reject H0 by critical value with 85.0% confidence.
Fail to reject H0 by p-value with 85.0% confidence.
Fail to reject H0 by critical value with 90.0% confidence.
Fail to reject H0 by p-value with 90.0% confidence.
Fail to reject H0 by critical value with 95.0% confidence.
Fail to reject H0 by p-value with 95.0% confidence.
Fail to reject H0 by critical value with 97.5% confidence.
Fail to reject H0 by p-value with 97.5% confidence.
Fail to reject H0 by critical value with 99.0% confidence.
Fail to reject H0 by p-value with 99.0% confidence.

Minus die:

Χ² = 4.74
Fail to reject H0 by critical value with 85.0% confidence.
Fail to reject H0 by p-value with 85.0% confidence.
Fail to reject H0 by critical value with 90.0% confidence.
Fail to reject H0 by p-value with 90.0% confidence.
Fail to reject H0 by critical value with 95.0% confidence.
Fail to reject H0 by p-value with 95.0% confidence.
Fail to reject H0 by critical value with 97.5% confidence.
Fail to reject H0 by p-value with 97.5% confidence.
Fail to reject H0 by critical value with 99.0% confidence.
Fail to reject H0 by p-value with 99.0% confidence.

For the Anderson-Darling test, we want to see if the sum of the top faces are normally distributed, once the probability of each sum of top faces are different. The test standardize the values so we expect an average 0 and standard deviation 1. Therefore, we have the following hypothesis:

H0: The standardized sum of the values for each sample of the die are normally distributed.
H1: The standardized sum of the values for each sample of the die follows another distribution.

The code for the test:

statistic, crit, sign = stats.anderson(Z_plus, 'norm')
print(f"A² = {'%.4f' % statistic}")
print('')for i in range(len(sign)):
  if statistic < crit[i]:
    print(f'Fail to reject H0 with {100-sign[i]}% confidence.')
  else:
    print(f'Reject H0 with {100-sign[i]}% confidence.')

Plus die:

A² = 0.4785
Fail to reject H0 with 85.0% confidence.
Fail to reject H0 with 90.0% confidence.
Fail to reject H0 with 95.0% confidence.
Fail to reject H0 with 97.5% confidence.
Fail to reject H0 with 99.0% confidence.

Minus die:

A² = 0.3270
Fail to reject H0 with 85.0% confidence.
Fail to reject H0 with 90.0% confidence.
Fail to reject H0 with 95.0% confidence.
Fail to reject H0 with 97.5% confidence.
Fail to reject H0 with 99.0% confidence.