Benford's Law, also known as the Newcomb-Benford law or the first-digit law, is a surprising observation about the leading digits of numbers in real-world datasets. In many naturally occurring collections of data, smaller leading digits (like 1 and 2) are significantly more common than larger ones (like 8 and 9).

  • Financial records
  • Scientific measurements
  • Astronomical distances
  • Street addresses

Why does this happen?

Real-world data often involves growth, multiplication, and comparisons across different scales. This "scaling invariance" creates a natural bias towards smaller leading digits.

How can Benford's Law be useful?

Benford's Law can be a quick and a powerful tool for detecting anomalies or fraud in data. If a dataset supposedly reflects real-world data but significantly deviates from Benford's Law, it might indicate manipulated or fabricated numbers.


💵 Real-world example using P-Card transactions

This notebook uses DC government's purchase card transactions data. From Open Data DC:

In an effort to promote transparency and accountability, DC is providing Purchase Card transaction data to let taxpayers know how their tax dollars are being spent. Purchase Card transaction information is updated monthly. The Purchase Card Program Management Office is part of the Office of Contracting and Procurement.

The latest dataset is available at https://opendata.dc.gov/datasets/DCGIS::purchase-card-transactions/about.

Import packages

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

Read dataset

In [2]:
df = pd.read_csv('DC_PCard_Transactions.csv')
df.head(3)
Out[2]:
AGENCYTRANSACTION_DATETRANSACTION_AMOUNTVENDOR_NAMEVENDOR_STATE_PROVINCEMCC_DESCRIPTIONDCS_LAST_MOD_DTTMOBJECTID
0Office of Latino Affairs2009/01/05 05:00:00+0016.80USPS 1050050275 QQQDCPostage Services-Government Only2009/04/28 20:57:31+001
1Department of Mental Health2009/01/05 05:00:00+00229.50WW GRAINGER 912DCIndustrial Supplies, Not Elsewhere Classified2009/04/28 20:57:31+002
2District Department of Transportation2009/01/05 05:00:00+003147.33BRANCH SUPPLYDCStationery, Office & School Supply Stores2009/04/28 20:57:31+003
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534641 entries, 0 to 534640
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AGENCY                 534641 non-null  object 
 1   TRANSACTION_DATE       534641 non-null  object 
 2   TRANSACTION_AMOUNT     534641 non-null  float64
 3   VENDOR_NAME            534602 non-null  object 
 4   VENDOR_STATE_PROVINCE  533012 non-null  object 
 5   MCC_DESCRIPTION        534623 non-null  object 
 6   DCS_LAST_MOD_DTTM      534641 non-null  object 
 7   OBJECTID               534641 non-null  int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 32.6+ MB

🧪 Benford's Law Analysis - First Digit

The code below grabs the first digits of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [4]:
# remove transactions with amounts that are negative or has a leading zero
# retrieve the first digit and use value_counts to find frequency
df_benford_first_digit = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 1] \
    .astype(str).str[0] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="first_digit") \
    .sort_values('first_digit')

# calculate percentages
df_benford_first_digit['actual_proportion'] = df_benford_first_digit['count'] / df_benford_first_digit['count'].sum()
df_benford_first_digit
Out[4]:
first_digitcountactual_proportion
011500730.293817
12992400.194295
23622320.121840
34519230.101656
45425810.083366
56304170.059551
67276490.054132
88231690.045361
79234860.045982

Benford's proposed distribution of leading digit frequencies is given by

\begin{equation} P_i=\log _{10}\left(\frac{i+1}{i}\right) ; \quad i \in\{1,2,3, \ldots, 9\}, \end{equation}

where $P_i$ is the probability of finding $i$ as the leading digit in a given number.

Create a new column that contains the Benford's proposed distribution of leading digit frequencies.

In [5]:
# append an expected_proportion column that contains Benford's distribution
df_benford_first_digit['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(1, 10)]
df_benford_first_digit
Out[5]:
first_digitcountactual_proportionbenford_proportion
011500730.2938170.301030
12992400.1942950.176091
23622320.1218400.124939
34519230.1016560.096910
45425810.0833660.079181
56304170.0595510.066947
67276490.0541320.057992
88231690.0453610.051153
79234860.0459820.045757

Plot distributions

In [6]:
fig = px.bar(
    data_frame=df_benford_first_digit,
    x='first_digit',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Leading Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'first_digit': 'First Digit',
    },
    height=500,
    barmode='group',
    template='simple_white'
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()