Product Reviews for Ordinal Quantification

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data. The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to three protocols that are designed for the evaluation of quantification methods. The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(50%), is a variant thereof, where only the smoothest 50% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions stem from actual products in the original data set. The data is represented by a RoBERTa embedding. In our experience, logistic regression classifiers work well with this representation. You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts. Extraction scripts and experiments: https://github.com/mirkobunse/regularized-oq Original data by McAuley: https://jmcauley.ucsd.edu/data/amazon/

{"references": ["M. Bunse, A. Moreo, F. Sebastiani, M. Senz (2022). Ordinal Quantification through Regularization.", "J. McAuley, C. Targett, Q. Shi, A. van den Hengel (2015). Image-based recommendations on styles and substitutes."]}

Related Organizations

TU Dortmund University
Germany

Keywords

machine learning, classification, data analysis, prevalence estimation

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average