publication . Preprint . 2017

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Bellemare, Marc G.; Danihelka, Ivo; Dabney, Will; Mohamed, Shakir; Lakshminarayanan, Balaji; Hoyer, Stephan; Munos, Rémi;
Open Access English
  • Published: 30 May 2017
The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Le...
free text keywords: Computer Science - Learning, Statistics - Machine Learning
Related Organizations
Download from
37 references, page 1 of 3

Arjovsky, M., Chintala, S., and Bottou, L. (2017). arXiv:1701.07875.

Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, to appear. [OpenAIRE]

Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics, pages 1196-1217. [OpenAIRE]

Chung, K.-J. and Sobel, M. J. (1987). Discounted MDP's: Distribution functions and exponential utility maximization. SIAM Journal on Control and Optimization, 25(1):49-62. [OpenAIRE]

Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons.

Danihelka, I., Lakshminarayanan, B., Uria, B., Wierstra, D., and Dayan, P. (2017). Comparison of Maximum Likelihood and GAN-based training of Real NVPs. arXiv preprint arXiv:1705.05263. [OpenAIRE]

Dedecker, J. and Merlevède, F. (2007). The empirical distribution function for dependent variables: asymptotic and nonasymptotic results in Lp. ESAIM: Probability and Statistics, 11:102-114. [OpenAIRE]

Dudley, R. M. (2002). Real analysis and probability, volume 74. Cambridge University Press.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. [OpenAIRE]

Esfahani, P. M. and Kuhn, D. (2015). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. arXiv preprint arXiv:1505.05116.

Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T. A. (2015). Learning with a Wasserstein loss. In Advances in Neural Information Processing Systems. [OpenAIRE]

Gao, R. and Kleywegt, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199. [OpenAIRE]

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359-378. [OpenAIRE]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel twosample test. Journal of Machine Learning Research, 13:723-773.

37 references, page 1 of 3
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue