Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability
arXiv:2507.17316v1 Announce Type: new
Abstract: We consider the problem of estimating a discrete distribution $p$ with support of size $K$ and provide both upper and lower bounds with high probability in KL divergence. We prove that in the worst case, for any estimator $widehat{p}$, with probability at least $delta$, $text{KL}(p | widehat{p}) geq Cmax{K,ln(K)ln(1/delta) }/n $, where $n$ is the sample size and $C > 0$ is a constant. We introduce a computationally efficient estimator $p^{text{OTB}}$, based on Online to Batch conversion and suffix averaging, and show that with probability at least $1 – delta$ $text{KL}(p | widehat{p}) leq C(Klog(log(K)) + ln(K)ln(1/delta)) /n$.
Furthermore, we also show that with sufficiently many observations relative to $log(1/delta)$, the maximum likelihood estimator $bar{p}$ guarantees that with probability at least $1-delta$ $$
1/6 chi^2(bar{p}|p) leq 1/4 chi^2(p|bar{p}) leq text{KL}(p|bar{p}) leq C(K + log(1/delta))/n,, $$ where $chi^2$ denotes the $chi^2$-divergence.
Abstract: We consider the problem of estimating a discrete distribution $p$ with support of size $K$ and provide both upper and lower bounds with high probability in KL divergence. We prove that in the worst case, for any estimator $widehat{p}$, with probability at least $delta$, $text{KL}(p | widehat{p}) geq Cmax{K,ln(K)ln(1/delta) }/n $, where $n$ is the sample size and $C > 0$ is a constant. We introduce a computationally efficient estimator $p^{text{OTB}}$, based on Online to Batch conversion and suffix averaging, and show that with probability at least $1 – delta$ $text{KL}(p | widehat{p}) leq C(Klog(log(K)) + ln(K)ln(1/delta)) /n$.
Furthermore, we also show that with sufficiently many observations relative to $log(1/delta)$, the maximum likelihood estimator $bar{p}$ guarantees that with probability at least $1-delta$ $$
1/6 chi^2(bar{p}|p) leq 1/4 chi^2(p|bar{p}) leq text{KL}(p|bar{p}) leq C(K + log(1/delta))/n,, $$ where $chi^2$ denotes the $chi^2$-divergence.
Dirk van der Hoeven, Julia Olkhovskaia, Tim van Erven
Go to original source