A Data Generating Process for Improper Payments

Author

Wade K. Copeland

Published

February 27, 2026

Statistical Methods

The proposed data generating process for improper payments involves a mixture distribution.

Payments are generated using a truncated gamma distribution X_i \sim \Gamma_T(\alpha, \theta) (see https://www.kingcopeland.com/truncated-gamma-rvs-py/ for details). Practically speaking, unlike the gamma distribution, the first and second moments of the truncated gamma distribution don’t have closed-form solutions, so the values of \alpha and \theta are solved for numerically for a given E[X] and CV[X] = \frac{SD[X]}{E[X]}.

Improper payments are straightforward. Let Y_i = B_iX_iZ_i where B_i \sim Unif(a \in [0, 1], b \ge a \in [0, 1]) is the percentage of each payment that is improper, and Z_{i} \sim Bin(1,p), where 1 is the number of trials and p is the probability of observing an improper payment (Bain and Engelhardt 1992, 4:92–95, 109–10).

Example

Suppose we want to generate 100,000 payments with maximum payment amount of $1,000, mean payment amount of $100, and a standard devation of 1/2 the payment amount (e.g., coefficient of variation equal to 0.5). The percentage of each payment that is improper is between 40% and 60% of the payment amount, and the probability that each payment is improper is 10%.

We can accomplish this in Python using the improper_payments_dgp python package [(Copeland 2026)]. From this module, the function improper_payments_dgp simulates improper payment data. It accepts the following arguments:

  • mean_target: The target mean payment amount.
  • cv_target: The target coefficient of variation (e.g., CV[X] = SD[X]/E[X]) for the payment amount.
  • A: The minimum payment amount.
  • B: The maximum payment amount.
  • b: Bounds of the uniform distribution (0 \le b \le 1) for the percentage of each payment that is improper.
  • p_improper: Probability that a given payment is improper.
  • size: The number of payments to generate.
  • random_state: The seed for random number generation to create reproducible results.
    • Set to an integer for consistent results.
    • Set to None if reproducibility is not required.
from improper_payments_dgp import improper_payments_dgp

pop_data = improper_payments_dgp(mean_target = 100, cv_target = 1/2, A = 0, B = 1000, b = (0.4, 0.6), p_improper = 0.1, size = 100000, random_state = 123)

Below is the population data. The function returns a Pandas DataFrame with the following:

  • X: Random variate(s) of a truncated gamma distribution for the payment amount.
  • B: Random variate(s) of a uniform distribution for the percent of the payment that is improper.
  • Z: Random variate(s) of a binomial distribution that indicate if a payment is improper.
  • Y: Random variate(s) for the improper payment amount.
pop_data
X B Z Y
0 116.246213 0.539294 0 0.0
1 35.021219 0.457228 0 0.0
2 59.890232 0.445370 0 0.0
3 55.471881 0.510263 0 0.0
4 54.394463 0.543894 0 0.0
... ... ... ... ...
99995 133.513989 0.476092 0 0.0
99996 79.941630 0.504328 0 0.0
99997 91.298149 0.573550 0 0.0
99998 22.203274 0.513245 0 0.0
99999 92.776933 0.522579 0 0.0

100000 rows × 4 columns

print(f"The mean payment amount is ${pop_data.X.mean():.2f} with a total payment amount of ${pop_data.X.sum():,.2f}.\nThe coefficient of variation for payment amounts is {pop_data.X.var()**0.5/pop_data.X.mean():.2%}.\nThe minimum and maximum percentages of improper payments are {pop_data.B.min():.2%} and {pop_data.B.max():.2%}, respectively.\nThe probability of an improper payment is {pop_data.Z.mean():.2%}.\nThe mean improper payment amount (conditional on being improper) is ${pop_data.Y[pop_data.Y > 0].mean():.2f} with a total improper payment amount of ${pop_data.Y.sum():,.2f}.")
The mean payment amount is $99.97 with a total payment amount of $9,996,861.11.
The coefficient of variation for payment amounts is 49.98%.
The minimum and maximum percentages of improper payments are 40.00% and 60.00%, respectively.
The probability of an improper payment is 9.95%.
The mean improper payment amount (conditional on being improper) is $59.66 with a total improper payment amount of $593,688.44.
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.hist(pop_data.Y[pop_data.Y > 0], bins = "fd", density = False, alpha = 0.6, color = "steelblue", edgecolor = 'white')
plt.title(f"Distribution of Improper Payments")
plt.xlabel("Improper Payment Amount ($)")
plt.ylabel("Frequency")
plt.grid(True, alpha = 0.2)
plt.show()

Session Information

All of the files needed to reproduce these results can be downloaded from the Git repository https://github.com/wkingc/improper-payments-dgp-py.

-----
improper_payments_dgp       0.1.6
matplotlib                  3.10.8
pandas                      3.0.0
session_info                v1.0.1
-----
Python 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]
macOS-26.3-arm64-arm-64bit-Mach-O
-----
Session information updated at 2026-02-27 17:12

References

Bain, Lee J, and Max Engelhardt. 1992. Introduction to Probability and Mathematical Statistics. Vol. 4. Duxbury Press Belmont, CA.
Copeland, Wade K. 2026. improper-payments-dgp: A Python package to simulate data for improper payments.” https://pypi.org/project/improper-payments-dgp/.