Benford's Law, also known as the Newcomb-Benford law or the first-digit law, is a surprising observation about the leading digits of numbers in real-world datasets. In many naturally occurring collections of data, smaller leading digits (like 1 and 2) are significantly more common than larger ones (like 8 and 9).

Financial records
Scientific measurements
Astronomical distances
Street addresses

Why does this happen?¶

Real-world data often involves growth, multiplication, and comparisons across different scales. This "scaling invariance" creates a natural bias towards smaller leading digits.

How can Benford's Law be useful?¶

Benford's Law can be a quick and a powerful tool for detecting anomalies or fraud in data. If a dataset supposedly reflects real-world data but significantly deviates from Benford's Law, it might indicate manipulated or fabricated numbers.

💵 Real-world example using P-Card transactions¶

This notebook uses DC government's purchase card transactions data. From Open Data DC:

In an effort to promote transparency and accountability, DC is providing Purchase Card transaction data to let taxpayers know how their tax dollars are being spent. Purchase Card transaction information is updated monthly. The Purchase Card Program Management Office is part of the Office of Contracting and Procurement.

The latest dataset is available at https://opendata.dc.gov/datasets/DCGIS::purchase-card-transactions/about.

Import packages¶

In [1]:

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

Read dataset¶

In [2]:

df = pd.read_csv('DC_PCard_Transactions.csv')
df.head(3)

Out[2]:

	AGENCY	TRANSACTION_DATE	TRANSACTION_AMOUNT	VENDOR_NAME	VENDOR_STATE_PROVINCE	MCC_DESCRIPTION	DCS_LAST_MOD_DTTM	OBJECTID
0	Office of Latino Affairs	2009/01/05 05:00:00+00	16.80	USPS 1050050275 QQQ	DC	Postage Services-Government Only	2009/04/28 20:57:31+00	1
1	Department of Mental Health	2009/01/05 05:00:00+00	229.50	WW GRAINGER 912	DC	Industrial Supplies, Not Elsewhere Classified	2009/04/28 20:57:31+00	2
2	District Department of Transportation	2009/01/05 05:00:00+00	3147.33	BRANCH SUPPLY	DC	Stationery, Office & School Supply Stores	2009/04/28 20:57:31+00	3

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534641 entries, 0 to 534640
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AGENCY                 534641 non-null  object 
 1   TRANSACTION_DATE       534641 non-null  object 
 2   TRANSACTION_AMOUNT     534641 non-null  float64
 3   VENDOR_NAME            534602 non-null  object 
 4   VENDOR_STATE_PROVINCE  533012 non-null  object 
 5   MCC_DESCRIPTION        534623 non-null  object 
 6   DCS_LAST_MOD_DTTM      534641 non-null  object 
 7   OBJECTID               534641 non-null  int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 32.6+ MB

🧪 Benford's Law Analysis - First Digit¶

The code below grabs the first digits of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [4]:

# remove transactions with amounts that are negative or has a leading zero
# retrieve the first digit and use value_counts to find frequency
df_benford_first_digit = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 1] \
    .astype(str).str[0] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="first_digit") \
    .sort_values('first_digit')

# calculate percentages
df_benford_first_digit['actual_proportion'] = df_benford_first_digit['count'] / df_benford_first_digit['count'].sum()
df_benford_first_digit

Out[4]:

	first_digit	count	actual_proportion
0	1	150073	0.293817
1	2	99240	0.194295
2	3	62232	0.121840
3	4	51923	0.101656
4	5	42581	0.083366
5	6	30417	0.059551
6	7	27649	0.054132
8	8	23169	0.045361
7	9	23486	0.045982

Benford's proposed distribution of leading digit frequencies is given by

\begin{equation} P_i=\log _{10}\left(\frac{i+1}{i}\right) ; \quad i \in\{1,2,3, \ldots, 9\}, \end{equation}

where $P_i$ is the probability of finding $i$ as the leading digit in a given number.

Create a new column that contains the Benford's proposed distribution of leading digit frequencies.

In [5]:

# append an expected_proportion column that contains Benford's distribution
df_benford_first_digit['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(1, 10)]
df_benford_first_digit

Out[5]:

	first_digit	count	actual_proportion	benford_proportion
0	1	150073	0.293817	0.301030
1	2	99240	0.194295	0.176091
2	3	62232	0.121840	0.124939
3	4	51923	0.101656	0.096910
4	5	42581	0.083366	0.079181
5	6	30417	0.059551	0.066947
6	7	27649	0.054132	0.057992
8	8	23169	0.045361	0.051153
7	9	23486	0.045982	0.045757

Plot distributions¶

In [6]:

fig = px.bar(
    data_frame=df_benford_first_digit,
    x='first_digit',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Leading Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'first_digit': 'First Digit',
    },
    height=500,
    barmode='group',
    template='simple_white'
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()

Judging whether a dataset deviates from Benford's Law¶

There are several ways to judge whether a dataset deviates from Benford's Law, and the choice of method depends on the size and complexity of your data. Here are some common approaches:

Visual Inspection
Chi-Square Test ($\chi^2$)
Mean Absolute Deviation (MAD)
Sum of Squared Differences (SSD)

Visual inspection is the simplest method, but does not generate numeric measures for comparison. Looking at the histogram above does not display any significant deviations. But this approach creates ambuigity in drawing a conclusion when deviation starts to widen.

The other three methods are empirically-based criteria for conformity to Benford's Law.

Although $\chi^2$ is the most common measure of conformity to Benford's Law, research suggests that $\chi^2$ is widely misused and misinterpreted.

This notebook will cover Mean Absolute Deviation (MAD) and Sum of Squared Differences (SSD). Both MAD and SSD are easy to calculate, provides a single numeric value summarizing the deviation, and useful for comparing deviations across different datasets.

Mean Absolute Deviation (MAD)¶

A MAD measure is given by

\begin{equation} \mathrm{MAD}=\frac{\sum_{i=1}^K|AP-EP|}{K}, \end{equation}

where

$K$ is the number of leading digit bins (9 for first leading digit; 90 for first two leading digits),
$i$ is a leading digit (between 1 and 9),
$AP$ is the actual proportion observed,
$EP$ is the expected proportion according to Benford's Law.

In [7]:

mad = abs(df_benford_first_digit['actual_proportion'] - df_benford_first_digit['benford_proportion']).sum() / df_benford_first_digit.shape[0]
print(f'MAD value = {round(mad, 4)}')

MAD value = 0.0061

Interpreting MAD value¶

Nigrini's study suggests the following MAD ranges and conclusions for first digits.

MAD Range	Conformity
0.000 to 0.006	Close confirmity
0.006 to 0.012	Acceptable conformity
0.012 to 0.015	Marginally acceptable conformity
Above 0.015	Nonconformity

In [8]:

def benford_first_digit_interpretation_MAD(mad):
    if mad < 0.006:
        return 'Close Conformity'
    elif mad < 0.012:
        return 'Acceptable Conformity'
    elif mad < 0.015:
        return 'Marginal Conformity'
    else:
        return 'Nonconformity'

In [9]:

print(f'MAD value of {round(mad, 4)} can be interpreted as {benford_first_digit_interpretation_MAD(mad)}.')

MAD value of 0.0061 can be interpreted as Acceptable Conformity.

Sum of Squared Differences (SSD)¶

A SSD measure is given by

\begin{equation} \mathrm{SSD}=\sum_{i=1}^K(A P-E P)^2 \times 10^4 \end{equation}

where

$K$ is the number of leading digit bins (9 for first leading digit; 90 for first two leading digits),
$i$ is a leading digit (between 1 and 9),
$AP$ is the actual proportion observed,
$EP$ is the expected proportion according to Benford's Law.

In [10]:

ssd = sum(((df_benford_first_digit['actual_proportion'] - df_benford_first_digit['benford_proportion']) ** 2) * (10 ** 4))
print(f'SSD value = {round(ssd, 4)}')

SSD value = 5.3623

Interpreting SSD value¶

Kossovsky's study suggests the following SSD ranges and conclusions for first digits.

SSD Range	Conformity
0 to 2	Perfect confirmity
2 to 25	Acceptable conformity
25-100	Marginally conformity
Above 100	Nonconformity

In [11]:

def benford_first_digit_interpretation_SSD(ssd):
    if ssd < 2:
        return 'Perfect Conformity'
    elif ssd < 25:
        return 'Acceptable Conformity'
    elif ssd < 100:
        return 'Marginal Conformity'
    else:
        return 'Nonconformity'

In [12]:

print(f'SSD value of {round(ssd, 1)} can be interpreted as {benford_first_digit_interpretation_SSD(ssd)}.')

SSD value of 5.4 can be interpreted as Acceptable Conformity.

Both MAD and SSD measures conclude "Acceptable conformity".

🧪 Benford's Law Analysis - Second Digit¶

While Benford's Law is most well-known for the first digit, it can be applied to the second digit as well. It can even be extended to analyze higher digits, although the predictions become less precise the further you go (we'll look at an analysis of first two digits combined in the upcoming section).

The code below grabs the second digit of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [13]:

# only keep transactions with an amount greater than or equal to $10
# retrieve the second digit and use value_counts to find frequency
# use reset_index() for a clean index from 1 to 9 (optional)
df_benford_second_digit = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 10] \
    .astype(str).str[1] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="second_digit") \
    .sort_values('second_digit') \
    .reset_index(drop=True)

# calculate percentages
df_benford_second_digit['actual_proportion'] = df_benford_second_digit['count'] / df_benford_second_digit['count'].sum()
df_benford_second_digit

Out[13]:

	second_digit	count	actual_proportion
0	0	79714	0.159995
1	1	47271	0.094878
2	2	50774	0.101909
3	3	42465	0.085232
4	4	48472	0.097289
5	5	59514	0.119451
6	6	38652	0.077579
7	7	40786	0.081862
8	8	37161	0.074586
9	9	53420	0.107220

In [14]:

# append an expected_proportion column that contains Benford's distribution
df_benford_second_digit['benford_proportion'] = [sum([np.log10(1 + 1 / (j + i)) 
    for j in np.arange(start=10, stop=99, step=10)]) for i in np.arange(10)]
df_benford_second_digit

Out[14]:

	second_digit	count	actual_proportion	benford_proportion
0	0	79714	0.159995	0.119679
1	1	47271	0.094878	0.113890
2	2	50774	0.101909	0.108821
3	3	42465	0.085232	0.104330
4	4	48472	0.097289	0.100308
5	5	59514	0.119451	0.096677
6	6	38652	0.077579	0.093375
7	7	40786	0.081862	0.090352
8	8	37161	0.074586	0.087570
9	9	53420	0.107220	0.084997

Plot distributions¶

In [15]:

fig = px.bar(
    data_frame=df_benford_second_digit,
    x='second_digit',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Second Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'second_digit': 'Second Digit',
        'count': 'Actual Count',
    },
    height=500,
    barmode='group',
    template='simple_white',
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
    legend=dict(
        yanchor="top",
        y=0.9,
        xanchor="left",
        x=0.75
    ),
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()

The chart shows distinct deviations.

Judging deviation of the second digits using MAD¶

In [16]:

mad_second_digits = abs(df_benford_second_digit['actual_proportion'] - df_benford_second_digit['benford_proportion']).sum() / df_benford_second_digit.shape[0]
print(f'MAD value = {round(mad_second_digits, 5)}')

MAD value = 0.01706

Interpreting MAD value¶

Nigrini's study suggests the following MAD ranges and conclusions for the second digits. Note that the ranges differ from the previous table where only the first digits were used for analysis.

MAD Range	Conformity
0.000 to 0.008	Close confirmity
0.008 to 0.010	Acceptable conformity
0.010 to 0.012	Marginally acceptable conformity
Above 0.012	Nonconformity

In [17]:

def benford_second_digit_interpretation_MAD(mad):
    if mad < 0.008:
        return 'Close Conformity'
    elif mad < 0.010:
        return 'Acceptable Conformity'
    elif mad < 0.012:
        return 'Marginal Conformity'
    else:
        return 'Nonconformity'

In [18]:

print(f'MAD value of {round(mad_second_digits, 4)} can be interpreted as {benford_second_digit_interpretation_MAD(mad_second_digits)}.')

MAD value of 0.0171 can be interpreted as Nonconformity.

⚠️ The second digit test indicates a possibility of manipulation!

🧪 Benford's Law Analysis - First Two Digits Combined¶

Benford's Law can be used for the first two digits (combined) as well. In fact, analyzing both the first and second digits can sometimes offer even stronger insights for data analysis and anomaly detection.

Combining the information from both digits allows for a more nuanced understanding of the data distribution.

The code below grabs the first two digits of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [19]:

# only keep transactions with an amount greater than or equal to $10
# retrieve the first two digits and use value_counts to find frequency
# use reset_index() for a clean index from 0 to 89 (optional)
df_benford_first_two_digits = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 10] \
    .astype(str).str[:2] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="first_two_digits") \
    .sort_values('first_two_digits') \
    .reset_index(drop=True)

# calculate percentages
df_benford_first_two_digits['actual_proportion'] = df_benford_first_two_digits['count'] / df_benford_first_two_digits['count'].sum()
df_benford_first_two_digits

Out[19]:

	first_two_digits	count	actual_proportion
0	10	23603	0.047374
1	11	16148	0.032411
2	12	16886	0.033892
3	13	13807	0.027712
4	14	13892	0.027883
...	...	...	...
85	95	2853	0.005726
86	96	1689	0.003390
87	97	1784	0.003581
88	98	1570	0.003151
89	99	4190	0.008410

90 rows × 3 columns

In [20]:

# append an expected_proportion column that contains Benford's distribution
df_benford_first_two_digits['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(10, 100)]
df_benford_first_two_digits

Out[20]:

	first_two_digits	count	actual_proportion	benford_proportion
0	10	23603	0.047374	0.041393
1	11	16148	0.032411	0.037789
2	12	16886	0.033892	0.034762
3	13	13807	0.027712	0.032185
4	14	13892	0.027883	0.029963
...	...	...	...	...
85	95	2853	0.005726	0.004548
86	96	1689	0.003390	0.004501
87	97	1784	0.003581	0.004454
88	98	1570	0.003151	0.004409
89	99	4190	0.008410	0.004365

90 rows × 4 columns

Plot distributions¶

In [21]:

fig = px.bar(
    data_frame=df_benford_first_two_digits,
    x='first_two_digits',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Leading Two Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'first_two_digits': 'First Two Digits',
    },
    height=500,
    barmode='group',
    template='simple_white',
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
    legend=dict(
        yanchor="top",
        y=0.9,
        xanchor="left",
        x=0.75
    ),
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()

The chart again shows distinct deviations.

Judging deviation of the first two digits using MAD¶

In [22]:

mad_first_two_digits = abs(df_benford_first_two_digits['actual_proportion'] - df_benford_first_two_digits['benford_proportion']).sum() / df_benford_first_two_digits.shape[0]
print(f'MAD value = {round(mad_first_two_digits, 5)}')

MAD value = 0.00208

Interpreting MAD value¶

Nigrini's study suggests the following MAD ranges and conclusions for the first two digits. Note that the ranges differ from the previous table where only the first digit was used for analysis.

MAD Range	Conformity
0.0000 to 0.0012	Close confirmity
0.0012 to 0.0018	Acceptable conformity
0.0018 to 0.0022	Marginally acceptable conformity
Above 0.0022	Nonconformity

In [23]:

def benford_interpretation_first_two_digits_MAD(mad):
    if mad < 0.0012:
        return 'Close Conformity'
    elif mad < 0.0018:
        return 'Acceptable Conformity'
    elif mad < 0.0022:
        return 'Marginal Conformity'
    else:
        return 'Nonconformity'

In [24]:

print(f'MAD value of {round(mad_first_two_digits, 5)} can be interpreted as {benford_interpretation_first_two_digits_MAD(mad_first_two_digits)}.')

MAD value of 0.00208 can be interpreted as Marginal Conformity.

Although the histogram shows notable deviations, the MAD measure is within the "marginal conformity" range.

😱 Nonconformity and marginal conformity... now what?¶

The summary of Benford's Law analysis is as follows:

Benford's Law Tested Digit(s)	Metric	Value	Interpretation
First Digit	SSD	0.0061	Acceptable conformity
First Digit	MAD	5.3623	Acceptable conformity
Second Digit	MAD	0.0171	Nonconformity
First Two Digits Combined	MAD	0.0021	Marginal Conformity

Although we should be slightly concerned with "Nonconformity" from the second digit test, we can't draw a conclusion that fraud or manipulation exists without a follow-up analysis.

Closing thoughts¶

Benford's Law holds true for datasets with specific characteristics, like naturally occurring populations, financial data, and physical measurements. It doesn't apply to human-assigned numbers like ID numbers or phone numbers.
Analyzing multiple digits (1st, 2nd, or combined) strengthens the detection power compared to just the first digit alone.
It's a statistical observation, not a rule, and deviations can occur.
Deviations from the law can indicate data manipulation or errors, making it a valuable tool for fraud detection.

Citations¶

Slepkov AD, Ironside KB, DiBattista D. Benford's Law: textbook exercises and multiple-choice testbanks. PLoS One. 2015 Feb 17;10(2):e0117972. doi: 10.1371/journal.pone.0117972. PMID: 25689468; PMCID: PMC4331362.

Benford's Law Application and Interpretation

Why does this happen?¶

How can Benford's Law be useful?¶

💵 Real-world example using P-Card transactions¶

Import packages¶

Read dataset¶

🧪 Benford's Law Analysis - First Digit¶

Plot distributions¶

Judging whether a dataset deviates from Benford's Law¶

Mean Absolute Deviation (MAD)¶

Interpreting MAD value¶

Sum of Squared Differences (SSD)¶

Interpreting SSD value¶

🧪 Benford's Law Analysis - Second Digit¶

Plot distributions¶

Judging deviation of the second digits using MAD¶

Interpreting MAD value¶

🧪 Benford's Law Analysis - First Two Digits Combined¶

Plot distributions¶

Judging deviation of the first two digits using MAD¶

Interpreting MAD value¶

😱 Nonconformity and marginal conformity... now what?¶

Closing thoughts¶

Citations¶