Title
Generalized Data Thinning Using Sufficient Statistics
Abstract
Sample splitting is one of the most tried-and-true tools in the data scientist toolbox. It breaks a data set into two independent parts, allowing one to perform valid inference after an exploratory analysis or after training a model. A recent paper (Neufeld, et al. 2023) provided a remarkable alternative to sample splitting, which the authors showed to be attractive in situations where sample splitting is not possible. Their method, called convolution-closed data thinning, proceeds very differently from sample splitting, and yet it also produces two statistically independent data sets from the original. In this talk, we will show that sufficiency is the key underlying principle that makes their approach possible. This insight leads naturally to a new framework that unifies both sample splitting and convolution-closed data thinning under the same procedure. Furthermore, we show that our framework greatly widens the scope of distributions where thinning is possible.