<<<<<<< HEAD Benford's law is a fascinating property that applies to many naturally occuring numbers that the leading digit distribution follows a non-uniform skewed distribution. It has been shown to apply to a wide variety of datasets including electricity bills, stock prices, lengths of rivers, Fibonacci numbers and the factorials, among others.
This repository contains a Jupyter notebook investigating whether the leading digits of weights in a neural network follow Benford's Law.
It appears that the weights of a network do not follow Benford's Law before training but do approximately follow Benford's Law after convergence and then deviate from Benford's Law when the model starts overfitting.
A surprising result is that it appears that model Validation Accuracy is maximised around the time when the leading digit MAD vs. Benford's Law is minimized! This has observed with MNIST & Fashion MNIST with several different architectures but needs to be explored on more architectures and datasets.
From the Wikipedia page, Benford's Law "states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time."
Here's a great Numberphile video talking about Benford's Law.
I compared the leading weight digit distribution before and after training convergence of a convolutional neural network architecture adapted from the Keras documentation. I compared the distributions of weights in just the first layer and all layers in the network for MNIST and Fashion MNIST.
The leading digit was calculated by ignoring the weight sign and taking the first non-zero digit in the weight value.
- Plot mean deviance over time as the network is trained
- see if sampling starting weights from Benford's law improves convergence
- Check if results hold for different architectures/datasets
- Perform goodness-of-fit distribution tests
- Compare different weight initializations
- Return weight statistics before/after training
- Investigate using deviation as a measure of network fit
Inspired by this Reddit thread.