A Data Generating Process for Improper Payments

Author

Wade K. Copeland

Published

February 27, 2026

Statistical Methods

The proposed data generating process for improper payments involves a mixture distribution.

Payments are generated using a truncated gamma distribution X_i \sim \Gamma_T(\alpha, \theta) (see https://www.kingcopeland.com/truncated-gamma-rvs-py/ for details). Practically speaking, unlike the gamma distribution, the first and second moments of the truncated gamma distribution don’t have closed-form solutions, so the values of \alpha and \theta are solved for numerically for a given E[X] and CV[X] = \frac{SD[X]}{E[X]}.

Improper payments are straightforward. Let Y_i = B_iX_iZ_i where B_i \sim Unif(a \in [0, 1], b \ge a \in [0, 1]) is the percentage of each payment that is improper, and Z_{i} \sim Bin(1,p), where 1 is the number of trials and p is the probability of observing an improper payment (Bain and Engelhardt 1992, 4:92–95, 109–10).

Example

Suppose we want to generate 100,000 payments with maximum payment amount of $1,000, mean payment amount of $100, and a standard devation of 1/2 the payment amount (e.g., coefficient of variation equal to 0.5). The percentage of each payment that is improper is between 40% and 60% of the payment amount, and the probability that each payment is improper is 10%.

We can accomplish this in Python using the improper_payments_dgp python package [(Copeland 2026)]. From this module, the function improper_payments_dgp simulates improper payment data. It accepts the following arguments:

mean_target: The target mean payment amount.
cv_target: The target coefficient of variation (e.g., CV[X] = SD[X]/E[X]) for the payment amount.
A: The minimum payment amount.
B: The maximum payment amount.
b: Bounds of the uniform distribution (0 \le b \le 1) for the percentage of each payment that is improper.
p_improper: Probability that a given payment is improper.
size: The number of payments to generate.
random_state: The seed for random number generation to create reproducible results.
- Set to an integer for consistent results.
- Set to None if reproducibility is not required.

from improper_payments_dgp import improper_payments_dgp

pop_data = improper_payments_dgp(mean_target = 100, cv_target = 1/2, A = 0, B = 1000, b = (0.4, 0.6), p_improper = 0.1, size = 100000, random_state = 123)

Below is the population data. The function returns a Pandas DataFrame with the following:

X: Random variate(s) of a truncated gamma distribution for the payment amount.
B: Random variate(s) of a uniform distribution for the percent of the payment that is improper.
Z: Random variate(s) of a binomial distribution that indicate if a payment is improper.
Y: Random variate(s) for the improper payment amount.

pop_data

	X	B	Z	Y
0	116.246213	0.539294	0	0.0
1	35.021219	0.457228	0	0.0
2	59.890232	0.445370	0	0.0
3	55.471881	0.510263	0	0.0
4	54.394463	0.543894	0	0.0
...	...	...	...	...
99995	133.513989	0.476092	0	0.0
99996	79.941630	0.504328	0	0.0
99997	91.298149	0.573550	0	0.0
99998	22.203274	0.513245	0	0.0
99999	92.776933	0.522579	0	0.0

100000 rows × 4 columns

print(f"The mean payment amount is ${pop_data.X.mean():.2f} with a total payment amount of ${pop_data.X.sum():,.2f}.\nThe coefficient of variation for payment amounts is {pop_data.X.var()**0.5/pop_data.X.mean():.2%}.\nThe minimum and maximum percentages of improper payments are {pop_data.B.min():.2%} and {pop_data.B.max():.2%}, respectively.\nThe probability of an improper payment is {pop_data.Z.mean():.2%}.\nThe mean improper payment amount (conditional on being improper) is ${pop_data.Y[pop_data.Y > 0].mean():.2f} with a total improper payment amount of ${pop_data.Y.sum():,.2f}.")

The mean payment amount is $99.97 with a total payment amount of $9,996,861.11.
The coefficient of variation for payment amounts is 49.98%.
The minimum and maximum percentages of improper payments are 40.00% and 60.00%, respectively.
The probability of an improper payment is 9.95%.
The mean improper payment amount (conditional on being improper) is $59.66 with a total improper payment amount of $593,688.44.

import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.hist(pop_data.Y[pop_data.Y > 0], bins = "fd", density = False, alpha = 0.6, color = "steelblue", edgecolor = 'white')
plt.title(f"Distribution of Improper Payments")
plt.xlabel("Improper Payment Amount ($)")
plt.ylabel("Frequency")
plt.grid(True, alpha = 0.2)
plt.show()

Session Information

All of the files needed to reproduce these results can be downloaded from the Git repository https://github.com/wkingc/improper-payments-dgp-py.

-----
improper_payments_dgp       0.1.6
matplotlib                  3.10.8
pandas                      3.0.0
session_info                v1.0.1
-----
Python 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]
macOS-26.3-arm64-arm-64bit-Mach-O
-----
Session information updated at 2026-02-27 17:12

References

Bain, Lee J, and Max Engelhardt. 1992. Introduction to Probability and Mathematical Statistics. Vol. 4. Duxbury Press Belmont, CA.

Copeland, Wade K. 2026. “improper-payments-dgp: A Python package to simulate data for improper payments.” https://pypi.org/project/improper-payments-dgp/.