In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future pe... View more
 C. Goutte and E. Gaussier. A probabilistic interpretation of precision, recall and F -score, with implication for evaluation. In Proceedings of the 27th European Conference on IR Research (ECIR), pages 345-359, Santiago de Compostela, Spain, 2005.
 D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009.
 J. K. Kruschke. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press, 2nd edition, 2014.
 C. D. Manning, P. Raghavan, and H. Schu¨tze. Introduction to Information Retrieval. Cambridge University Press, 2008.
 A. Patil, D. Huard, and C. J. Fonnesbeck. PyMC: Bayesian stochastic modelling in Python. Journal of Statistical Software, 35(4):1-81, 2010.
 T. Sakai. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 525-532, Seattle, WA, USA, 2006.
 F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1-47, 2002.
 C. J. van Rijsbergen. Information Retrieval. Butterworths, London, UK, 2nd edition, 1979.
 Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 42-49, Berkeley, CA, USA, 1999.